Adrian Ionescu created SPARK-21538:
--------------------------------------

             Summary: Attribute resolution inconsistency in Dataset API
                 Key: SPARK-21538
                 URL: https://issues.apache.org/jira/browse/SPARK-21538
             Project: Spark
          Issue Type: Story
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Adrian Ionescu


{code}
spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
spark.range(1).withColumnRenamed("id", "x").sort('id) // works
spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among 
(x);
...
{code}

It looks like the Dataset API functions taking {{String}} use the basic 
resolver that only look at the columns at that level, whereas all the other 
means of expressing an attribute are lazily resolved during the analyzer.

The reason why the first 3 calls work is explained in the docs for {{object 
ResolveMissingReferences}}:
{code}
  /**
   * In many dialects of SQL it is valid to sort by attributes that are not 
present in the SELECT
   * clause.  This rule detects such queries and adds the required attributes 
to the original
   * projection, so that they will be available during sorting. Another 
projection is added to
   * remove these attributes after sorting.
   *
   * The HAVING clause could also used a grouping columns that is not presented 
in the SELECT.
   */
{code}

For consistency, it would be good to use the same attribute resolution 
mechanism everywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to