xiongbo-sjtu commented on PR #54745: URL: https://github.com/apache/spark/pull/54745#issuecomment-4056345783
The main motivation is architectural consistency. Every other sketch family in Spark (HLL, Theta, Tuple, KLL) follows the *_sketch_agg / *_merge_agg / scalar query pattern with opaque BinaryType output designed for table storage and multi-level rollup. The approx_top_k family uses a different pattern (struct-based state, separate accumulate/combine/estimate) that predates this convention. The items_sketch functions bring frequent-items into the same consistent API shape, with a self-describing binary wire format suitable for persisting in tables and merging across time horizons without re-scanning raw data. That said, if reviewers feel strongly that we should enhance the existing approx_top_k functions instead (e.g., adding point frequency queries, binary output format, etc.), I'm open to that direction. Happy to discuss which approach the community prefers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
