[
https://issues.apache.org/jira/browse/SPARK-53947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yuchuan Huang updated SPARK-53947:
----------------------------------
Description: Spark uses FrequentItemsSketch of Apache DataSketches in the
approx_top_k function, which does not consider NULL values by itself
([https://github.com/apache/datasketches-java/blob/main/src/main/java/org/apache/datasketches/frequencies/FrequentItemsSketch.java#L587).]
However, NULL value could be meaningful in some use cases and users might want
to include NULL in the approx_top_k output. Therefore, this ticket aims to add
a nullCounter associated with the FrequentItemsSketch to count for NULL in the
approx_top_k aggregation. (was: Spark uses FrequentItemsSketch of Apache
DataSketches in the `approx_top_k` function, which does not consider NULL
values by itself
([https://github.com/apache/datasketches-java/blob/main/src/main/java/org/apache/datasketches/frequencies/FrequentItemsSketch.java#L587).]
However, NULL value could be meaningful in some use cases and users might want
to include NULL in the `approx_top_k` output. Therefore, this ticket aims to
add a nullCounter associated with the FrequentItemsSketch in the `approx_top_k`
aggregation. )
> Let approx_top_k handle NULLs
> -----------------------------
>
> Key: SPARK-53947
> URL: https://issues.apache.org/jira/browse/SPARK-53947
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.1.0
> Reporter: Yuchuan Huang
> Priority: Critical
>
> Spark uses FrequentItemsSketch of Apache DataSketches in the approx_top_k
> function, which does not consider NULL values by itself
> ([https://github.com/apache/datasketches-java/blob/main/src/main/java/org/apache/datasketches/frequencies/FrequentItemsSketch.java#L587).]
> However, NULL value could be meaningful in some use cases and users might
> want to include NULL in the approx_top_k output. Therefore, this ticket aims
> to add a nullCounter associated with the FrequentItemsSketch to count for
> NULL in the approx_top_k aggregation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]