Yash Datta created SPARK-6006:
---------------------------------

             Summary: Optimize count distinct in case high cardinality columns
                 Key: SPARK-6006
                 URL: https://issues.apache.org/jira/browse/SPARK-6006
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.2.1, 1.1.1
            Reporter: Yash Datta
            Priority: Minor
             Fix For: 1.3.0


In case there are a lot of distinct values, count distinct becomes too slow 
since it tries to hash partial results to one map. It can be improved by 
creating buckets/partial maps in an intermediate stage where same key from 
multiple partial maps of first stage hash to the same bucket. Later we can sum 
the size of these buckets to get total distinct count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to