Re: Should enforce the uniqueness of field name in DataFrame ?

Reynold Xin Thu, 15 Oct 2015 00:00:04 -0700

That could break a lot of applications. In particular, a lot of input data
sources (csv, json) don't have clean schema, and can have duplicate column
names.


For the case of join, maybe a better solution is to ask the left/right
prefix/suffix in the user code, similar to what Pandas does.

On Wed, Oct 14, 2015 at 7:26 PM, Jeff Zhang <[email protected]> wrote:

>
> Currently seems DataFrame doesn't enforce the uniqueness of field name. So
> it is possible to have same fields in DataFrame. It usually happens after
> join especially self-join. Although user can rename the column names before
> join, or rename the column names after join (DataFrame#withColunmRenamed is
> not sufficient for now).  In hive, the ambiguous name can be resolved by
> using the table name as prefix, but seems DataFrame don't support it ( I
> mean DataFrame API rather than SparkSQL). I think we have 2 options here
> 1. Enforce the uniqueness of field name in DataFrame, so that the
> following operations would not cause ambiguous column reference
> 2. Provide DataFrame#withColunmsRenamed(oldColumns:Seq[String],
> newColumns:Seq[String]) to allow change schema names
>
> For now, I would prefer option 2 which is more easier to implement and
> keep compatibility.
>
>
> val df = ...        // schema (name, age)
> val df2 = df.join(df, "name")   // schema (name, age, age)
> df2.select("age")   // ambiguous column reference.
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Should enforce the uniqueness of field name in DataFrame ?

Reply via email to