[GitHub] spark pull request: [SPARK-12725] [SQL] Resolving Name Conflicts i...

gatorsmile Wed, 03 Feb 2016 06:34:43 -0800

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/11050


    [SPARK-12725] [SQL] Resolving Name Conflicts in SQL Generation by Adding a 
flag `isGenerated` to Alias and AttributeReference

    Some analysis rules generate auxiliary attribute references with the same 
name but different expression IDs. For example, `ResolveAggregateFunctions` 
introduces `havingCondition` and `aggOrder`, and `DistinctAggregationRewriter` 
introduces `gid`.
    
    This is OK for normal query execution since these attribute references get 
expression IDs. However, it's troublesome when converting resolved query plans 
back to SQL query strings since expression IDs are erased.
    
    Here's an example Spark 1.6.0 snippet for illustration:
    ```scala
    sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
    sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), 
COUNT(b)").explain(true)
    ```
    The above code produces the following resolved plan:
    ```
    == Analyzed Logical Plan ==
    _c0: bigint
    Project [_c0#101L]
    +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
       +- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) 
AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS 
aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
          +- Subquery t
             +- Project [id#46L AS a#47L,id#46L AS b#48L]
                +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at 
<console>:26
    ```
    Here we can see that both aggregate expressions in `ORDER BY` are extracted 
into an `Aggregate` operator, and both of them are named `aggOrder` with 
different expression IDs.
    
    Solution is to automatically add the expression IDs for the Alias and 
AttributeReferences that are generated by Analyzer in SQL Generation. 
    
    Could you review the solution? @marmbrus @liancheng 
    
    I did not set the newly added flag for all the alias and attribute 
reference generated by Analyzers. Please let me know if I should do it? Thank 
you! 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark namingConflicts

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11050.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11050
    
----
commit 7937d2be2e163736ce90857bf6eb4209001e32e5
Author: gatorsmile <[email protected]>
Date:   2016-02-02T07:35:06Z

    turn on the test.

commit 82bb46fefd69a74f699ff97be5e63a866c318a80
Author: gatorsmile <[email protected]>
Date:   2016-02-03T14:20:09Z

    added a flag isGenerated to Alias and AttributeReference

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-12725] [SQL] Resolving Name Conflicts i...

Reply via email to