Re: Should enforce the uniqueness of field name in DataFrame ?

Koert Kuipers Thu, 15 Oct 2015 00:28:56 -0700

if DataFrame aspires to be more than a vehicle for SQL then i think it
would be mistake to allow multiple column names. it is very confusing.
pandas indeed allows this and it has led to many bugs. R does not allow it
for data.frame (it renames the name dupes).


i would consider a csv with duplicate column names invalid and it should
not be loaded, or if it is loaded dupes should be renamed (e.g. append a
"1" to the name).

DataFrame did check for duplicate column names until Sep 2014, but then the
check got pushed into the SQL planner making DataFrame standalone (so
without SQL) less useful as an API.

i filed a jira about this a while ago here:
https://issues.apache.org/jira/browse/SPARK-8817



On Thu, Oct 15, 2015 at 3:05 AM, Xiao Li <gatorsm...@gmail.com> wrote:

> True. As long as we can ensure the correct message are printed out, users
> can correct their app easily. For example, Reference 'name' is ambiguous,
> could be: name#1, name#5.;
>
> Thanks,
>
> Xiao Li
>
> 2015-10-14 23:58 GMT-07:00 Reynold Xin <r...@databricks.com>:
>
>> That could break a lot of applications. In particular, a lot of input
>> data sources (csv, json) don't have clean schema, and can have duplicate
>> column names.
>>
>> For the case of join, maybe a better solution is to ask the left/right
>> prefix/suffix in the user code, similar to what Pandas does.
>>
>> On Wed, Oct 14, 2015 at 7:26 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>>
>>> Currently seems DataFrame doesn't enforce the uniqueness of field name.
>>> So it is possible to have same fields in DataFrame. It usually happens
>>> after join especially self-join. Although user can rename the column names
>>> before join, or rename the column names after join
>>> (DataFrame#withColunmRenamed is not sufficient for now).  In hive, the
>>> ambiguous name can be resolved by using the table name as prefix, but seems
>>> DataFrame don't support it ( I mean DataFrame API rather than SparkSQL). I
>>> think we have 2 options here
>>> 1. Enforce the uniqueness of field name in DataFrame, so that the
>>> following operations would not cause ambiguous column reference
>>> 2. Provide DataFrame#withColunmsRenamed(oldColumns:Seq[String],
>>> newColumns:Seq[String]) to allow change schema names
>>>
>>> For now, I would prefer option 2 which is more easier to implement and
>>> keep compatibility.
>>>
>>>
>>> val df = ...        // schema (name, age)
>>> val df2 = df.join(df, "name")   // schema (name, age, age)
>>> df2.select("age")   // ambiguous column reference.
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>

Re: Should enforce the uniqueness of field name in DataFrame ?

Reply via email to