True. As long as we can ensure the correct message are printed out, users can correct their app easily. For example, Reference 'name' is ambiguous, could be: name#1, name#5.;
Thanks, Xiao Li 2015-10-14 23:58 GMT-07:00 Reynold Xin <r...@databricks.com>: > That could break a lot of applications. In particular, a lot of input data > sources (csv, json) don't have clean schema, and can have duplicate column > names. > > For the case of join, maybe a better solution is to ask the left/right > prefix/suffix in the user code, similar to what Pandas does. > > On Wed, Oct 14, 2015 at 7:26 PM, Jeff Zhang <zjf...@gmail.com> wrote: > >> >> Currently seems DataFrame doesn't enforce the uniqueness of field name. >> So it is possible to have same fields in DataFrame. It usually happens >> after join especially self-join. Although user can rename the column names >> before join, or rename the column names after join >> (DataFrame#withColunmRenamed is not sufficient for now). In hive, the >> ambiguous name can be resolved by using the table name as prefix, but seems >> DataFrame don't support it ( I mean DataFrame API rather than SparkSQL). I >> think we have 2 options here >> 1. Enforce the uniqueness of field name in DataFrame, so that the >> following operations would not cause ambiguous column reference >> 2. Provide DataFrame#withColunmsRenamed(oldColumns:Seq[String], >> newColumns:Seq[String]) to allow change schema names >> >> For now, I would prefer option 2 which is more easier to implement and >> keep compatibility. >> >> >> val df = ... // schema (name, age) >> val df2 = df.join(df, "name") // schema (name, age, age) >> df2.select("age") // ambiguous column reference. >> >> -- >> Best Regards >> >> Jeff Zhang >> > >