That could break a lot of applications. In particular, a lot of input data sources (csv, json) don't have clean schema, and can have duplicate column names.
For the case of join, maybe a better solution is to ask the left/right prefix/suffix in the user code, similar to what Pandas does. On Wed, Oct 14, 2015 at 7:26 PM, Jeff Zhang <zjf...@gmail.com> wrote: > > Currently seems DataFrame doesn't enforce the uniqueness of field name. So > it is possible to have same fields in DataFrame. It usually happens after > join especially self-join. Although user can rename the column names before > join, or rename the column names after join (DataFrame#withColunmRenamed is > not sufficient for now). In hive, the ambiguous name can be resolved by > using the table name as prefix, but seems DataFrame don't support it ( I > mean DataFrame API rather than SparkSQL). I think we have 2 options here > 1. Enforce the uniqueness of field name in DataFrame, so that the > following operations would not cause ambiguous column reference > 2. Provide DataFrame#withColunmsRenamed(oldColumns:Seq[String], > newColumns:Seq[String]) to allow change schema names > > For now, I would prefer option 2 which is more easier to implement and > keep compatibility. > > > val df = ... // schema (name, age) > val df2 = df.join(df, "name") // schema (name, age, age) > df2.select("age") // ambiguous column reference. > > -- > Best Regards > > Jeff Zhang >