Re: Should enforce the uniqueness of field name in DataFrame ?

Xiao Li Thu, 15 Oct 2015 00:06:07 -0700

True. As long as we can ensure the correct message are printed out, users
can correct their app easily. For example, Reference 'name' is ambiguous,
could be: name#1, name#5.;


Thanks,

Xiao Li

2015-10-14 23:58 GMT-07:00 Reynold Xin <r...@databricks.com>:

> That could break a lot of applications. In particular, a lot of input data
> sources (csv, json) don't have clean schema, and can have duplicate column
> names.
>
> For the case of join, maybe a better solution is to ask the left/right
> prefix/suffix in the user code, similar to what Pandas does.
>
> On Wed, Oct 14, 2015 at 7:26 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>>
>> Currently seems DataFrame doesn't enforce the uniqueness of field name.
>> So it is possible to have same fields in DataFrame. It usually happens
>> after join especially self-join. Although user can rename the column names
>> before join, or rename the column names after join
>> (DataFrame#withColunmRenamed is not sufficient for now).  In hive, the
>> ambiguous name can be resolved by using the table name as prefix, but seems
>> DataFrame don't support it ( I mean DataFrame API rather than SparkSQL). I
>> think we have 2 options here
>> 1. Enforce the uniqueness of field name in DataFrame, so that the
>> following operations would not cause ambiguous column reference
>> 2. Provide DataFrame#withColunmsRenamed(oldColumns:Seq[String],
>> newColumns:Seq[String]) to allow change schema names
>>
>> For now, I would prefer option 2 which is more easier to implement and
>> keep compatibility.
>>
>>
>> val df = ...        // schema (name, age)
>> val df2 = df.join(df, "name")   // schema (name, age, age)
>> df2.select("age")   // ambiguous column reference.
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>

Re: Should enforce the uniqueness of field name in DataFrame ?

Reply via email to