DataFrames, as far as I can tell, don’t have an equivalent to SQL’s table
aliases.

This is essential when joining dataframes that have identically named
columns.

>>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, 
>>> "other": "I know"}']))>>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": 
>>> 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] == df2['a'])>>> 
>>> df12
DataFrame[a: bigint, other: string, a: bigint, other: string]>>>
df12.printSchema()
root
 |-- a: long (nullable = true)
 |-- other: string (nullable = true)
 |-- a: long (nullable = true)
 |-- other: string (nullable = true)

Now, trying any one of the following:

df12.select('a')
df12['a']
df12.a

yields this:

org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous,
could be: a#360L, a#358L.;

Same goes for accessing the other field.

This is good, but what are we supposed to do in this case?

SQL solves this by fully qualifying the column name with the table name,
and also offering table aliasing <http://dba.stackexchange.com/a/5991/2660>
in the case where you are joining a table to itself.

If we translate this directly into DataFrames lingo, perhaps it would look
something like:

df12['df1.a']
df12['df2.other']

But I’m not sure how this fits into the larger API. This certainly isn’t
backwards compatible with how joins are done now.

So what’s the recommended course of action here?

Having to unique-ify all your column names before joining doesn’t sound
like a nice solution.

Nick
​

Reply via email to