Re: [PR] [SPARK-55939][SQL] Add built-in DataSketches ItemsSketch (Frequent Items) functions [spark]

via GitHub Fri, 13 Mar 2026 09:23:16 -0700


xiongbo-sjtu commented on PR #54745:
URL: https://github.com/apache/spark/pull/54745#issuecomment-4056345783


   The main motivation is architectural consistency. Every other sketch family 
in Spark (HLL, Theta, Tuple, KLL) follows the *_sketch_agg / *_merge_agg / 
scalar query pattern with opaque BinaryType output designed for table storage 
and multi-level rollup. The approx_top_k family uses a different pattern 
(struct-based state, separate accumulate/combine/estimate) that predates this 
convention. The items_sketch functions bring frequent-items into the same 
consistent API shape, with a self-describing binary wire format suitable for 
persisting in tables and merging across time horizons without re-scanning raw 
data.
   
   That said, if reviewers feel strongly that we should enhance the existing 
approx_top_k functions instead (e.g., adding point frequency queries, binary 
output format, etc.), I'm open to that direction. Happy to discuss which 
approach the community prefers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55939][SQL] Add built-in DataSketches ItemsSketch (Frequent Items) functions [spark]

Reply via email to