[
https://issues.apache.org/jira/browse/SPARK-53947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gengliang Wang updated SPARK-53947:
-----------------------------------
Parent: SPARK-53885
Issue Type: Sub-task (was: Improvement)
> Let approx_top_k handle NULLs
> -----------------------------
>
> Key: SPARK-53947
> URL: https://issues.apache.org/jira/browse/SPARK-53947
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.1.0
> Reporter: Yuchuan Huang
> Priority: Critical
> Labels: pull-request-available
>
> Spark uses FrequentItemsSketch of Apache DataSketches in the approx_top_k
> function, which does not consider NULL values by itself
> ([https://github.com/apache/datasketches-java/blob/main/src/main/java/org/apache/datasketches/frequencies/FrequentItemsSketch.java#L587).]
> However, NULL value could be meaningful in some use cases and users might
> want to include NULL in the approx_top_k output. Therefore, this ticket aims
> to add a nullCounter associated with the FrequentItemsSketch to count for
> NULL in the approx_top_k aggregation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]