Yash Datta created SPARK-6006:
---------------------------------
Summary: Optimize count distinct in case high cardinality columns
Key: SPARK-6006
URL: https://issues.apache.org/jira/browse/SPARK-6006
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 1.2.1, 1.1.1
Reporter: Yash Datta
Priority: Minor
Fix For: 1.3.0
In case there are a lot of distinct values, count distinct becomes too slow
since it tries to hash partial results to one map. It can be improved by
creating buckets/partial maps in an intermediate stage where same key from
multiple partial maps of first stage hash to the same bucket. Later we can sum
the size of these buckets to get total distinct count.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]