GitHub user hvanhovell opened a pull request: https://github.com/apache/spark/pull/9280
[SPARK-9241] [SQL] [WIP] Supporting multiple DISTINCT columns This PR adds support for multiple distinct columns to the new aggregation code path. The implementation uses the ```OpenHashSet``` class and set expressions. As a result we can only use the slower sort based aggregation code path. This also means the code will be probably slower than the old hash aggregation. The PR is currently in the proof of concept phase, and I have submitted it to get some feedback to see if I am headed in the right direction. I'll add more tests if this considered to be the way to go. An example using the new code path: val df = sqlContext .range(1 << 25) .select( $"id".as("employee_id"), (rand(6321782L) * 4 + 1).cast("int").as("department_id"), when(rand(981293L) >= 0.5, "M").otherwise("F").as("gender"), (rand(7123L) * 3 + 1).cast("int").as("education_level") ) df.registerTempTable("employee") // Regular query. sql(""" select department_id as d, count(distinct gender, education_level) as c0, count(distinct gender) as c1, count(distinct education_level) as c2 from employee group by department_id """).show() cc @yhuai You can merge this pull request into a Git repository by running: $ git pull https://github.com/hvanhovell/spark SPARK-9241 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9280.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9280 ---- commit 256e1f6902b8adbc304c6e287d7cfdf2ef97b12b Author: Herman van Hovell <hvanhov...@questtec.nl> Date: 2015-10-26T12:46:33Z Created distinct fallback mechanism. commit 6a87384de8d934327ead72daf7210e29be8687b6 Author: Herman van Hovell <hvanhov...@questtec.nl> Date: 2015-10-26T13:35:01Z Added fallback distinct creation to aggregate conversion. commit 3bd6db5390dee044ab4673e38329f584b0436a66 Author: Herman van Hovell <hvanhov...@questtec.nl> Date: 2015-10-26T15:07:22Z Fix style. Fix CG for OpenHashSetUDT. Fix bug. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org