[
https://issues.apache.org/jira/browse/SPARK-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-16484.
----------------------------------
Fix Version/s: 3.5.0
Resolution: Fixed
Issue resolved by pull request 40615
[https://github.com/apache/spark/pull/40615]
> Incremental Cardinality estimation operations with Hyperloglog
> --------------------------------------------------------------
>
> Key: SPARK-16484
> URL: https://issues.apache.org/jira/browse/SPARK-16484
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Yongjia Wang
> Assignee: Ryan Berti
> Priority: Major
> Labels: bulk-closed
> Fix For: 3.5.0
>
>
> Efficient cardinality estimation is very important, and SparkSQL has had
> approxCountDistinct based on Hyperloglog for quite some time. However, there
> isn't a way to do incremental estimation. For example, if we want to get
> updated distinct counts of the last 90 days, we need to do the aggregation
> for the entire window over and over again. The more efficient way involves
> serializing the counter for smaller time windows (such as hourly) so the
> counts can be efficiently updated in an incremental fashion for any time
> window.
> With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus
> implementation in the current Spark version, it's easy enough to extend the
> functionality to include incremental counting, and even other general set
> operations such as intersection and set difference. Spark API is already as
> elegant as it can be, but it still takes quite some effort to do a custom
> implementation of the aforementioned operations which are supposed to be in
> high demand. I have been searching but failed to find an usable existing
> solution nor any ongoing effort for this. The closest I got is the following
> but it does not work with Spark 1.6 due to API changes.
> https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala
> I wonder if it worth to integrate such operations into SparkSQL. The only
> problem I see is it depends on serialization of a specific HLL implementation
> and introduce compatibility issues. But as long as the user is aware of such
> issue, it should be fine.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]