[GitHub] spark pull request: [SPARK-9241] [SQL] [WIP] Supporting multiple D...

hvanhovell Mon, 26 Oct 2015 08:25:14 -0700

GitHub user hvanhovell opened a pull request:

    https://github.com/apache/spark/pull/9280


    [SPARK-9241] [SQL] [WIP] Supporting multiple DISTINCT columns

    This PR adds support for multiple distinct columns to the new aggregation 
code path. 
    
    The implementation uses the ```OpenHashSet``` class and set expressions. As 
a result we can only use the slower sort based aggregation code path. This also 
means the code will be probably slower than the old hash aggregation.
    
    The PR is currently in the proof of concept phase, and I have submitted it 
to get some feedback to see if I am headed in the right direction. I'll add 
more tests if this considered to be the way to go.
    
    An example using the new code path:
    
        val df = sqlContext
          .range(1 << 25)
          .select(
            $"id".as("employee_id"),
            (rand(6321782L) * 4 + 1).cast("int").as("department_id"),
            when(rand(981293L) >= 0.5, "M").otherwise("F").as("gender"),
            (rand(7123L) * 3 + 1).cast("int").as("education_level")
          )
    
        df.registerTempTable("employee")
    
        // Regular query.
        sql("""
        select   department_id as d,
                 count(distinct gender, education_level) as c0,
                 count(distinct gender) as c1,
                 count(distinct education_level) as c2
        from     employee
        group by department_id
        """).show()
    
    cc @yhuai 
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hvanhovell/spark SPARK-9241

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9280.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9280
    
----
commit 256e1f6902b8adbc304c6e287d7cfdf2ef97b12b
Author: Herman van Hovell <hvanhov...@questtec.nl>
Date:   2015-10-26T12:46:33Z

    Created distinct fallback mechanism.

commit 6a87384de8d934327ead72daf7210e29be8687b6
Author: Herman van Hovell <hvanhov...@questtec.nl>
Date:   2015-10-26T13:35:01Z

    Added fallback distinct creation to aggregate conversion.

commit 3bd6db5390dee044ab4673e38329f584b0436a66
Author: Herman van Hovell <hvanhov...@questtec.nl>
Date:   2015-10-26T15:07:22Z

    Fix style. Fix CG for OpenHashSetUDT. Fix bug.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9241] [SQL] [WIP] Supporting multiple D...

Reply via email to