Adrian Ionescu created SPARK-21538:
--------------------------------------
Summary: Attribute resolution inconsistency in Dataset API
Key: SPARK-21538
URL: https://issues.apache.org/jira/browse/SPARK-21538
Project: Spark
Issue Type: Story
Components: SQL
Affects Versions: 3.0.0
Reporter: Adrian Ionescu
{code}
spark.range(1).withColumnRenamed("id", "x").sort(col("id")) // works
spark.range(1).withColumnRenamed("id", "x").sort($"id") // works
spark.range(1).withColumnRenamed("id", "x").sort('id) // works
spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among
(x);
...
{code}
It looks like the Dataset API functions taking {{String}} use the basic
resolver that only look at the columns at that level, whereas all the other
means of expressing an attribute are lazily resolved during the analyzer.
The reason why the first 3 calls work is explained in the docs for {{object
ResolveMissingReferences}}:
{code}
/**
* In many dialects of SQL it is valid to sort by attributes that are not
present in the SELECT
* clause. This rule detects such queries and adds the required attributes
to the original
* projection, so that they will be available during sorting. Another
projection is added to
* remove these attributes after sorting.
*
* The HAVING clause could also used a grouping columns that is not presented
in the SELECT.
*/
{code}
For consistency, it would be good to use the same attribute resolution
mechanism everywhere.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]