Re: Should enforce the uniqueness of field name in DataFrame ?

Michael Armbrust Thu, 15 Oct 2015 11:41:52 -0700

>
>  In hive, the ambiguous name can be resolved by using the table name as
> prefix, but seems DataFrame don't support it ( I mean DataFrame API rather
> than SparkSQL)



You can do the same using pure DataFrames.

Seq((1,2)).toDF("a", "b").registerTempTable("y")
Seq((1,4)).toDF("a", "b").registerTempTable("x")

table("x").join(table("y"), $"x.a" === $"y.a").select("y.b", "x.b").show()
+-+-+
|b|b|
+-+-+
|2|4|
+-+-+

DataFrame did check for duplicate column names until Sep 2014, but then the
> check got pushed into the SQL planner making DataFrame standalone (so
> without SQL) less useful as an API.


The check in question was removed because it made it impossible to even
reason about a schema that had duplicate column names.  In general, it
seems restrictive to throw an error if duplicate column names exist in an
intermediate schema even when they aren't referenced ambiguously.  We could
consider adding an option to throw an error during analysis for this case,
but it certainly shouldn't be in the constructor of StructType.  My guess
is an option to rename as Reynold suggests would be more popular (though
this could probably not be the default without breaking things).

Anther option that seems nice to me is to always add default qualifiers of
left/right when doing a join.  So you could always do:

df.join(df).where("left.a = right.a")

Even when you didn't manually specify left/right.  This could be done only
when there is not a qualifier already called left or right.

Re: Should enforce the uniqueness of field name in DataFrame ?

Reply via email to