cboumalh commented on PR #51298: URL: https://github.com/apache/spark/pull/51298#issuecomment-3146708401
Hi @HyukjinKwon @cloud-fan @gengliangwang — hope you're doing well. I wanted to resurface this PR (#51298), which adds Theta Sketch support to Spark SQL. It extends the existing HyperLogLog functionality by enabling set operations like intersection and difference, with full SQL and Python API support. We’ve been using this heavily at Amazon in production Spark pipelines for scalable set analytics (like segmentation and churn). The implementation includes tests and benchmarks. I proposed the idea in the dev mailing list and it received positive feedback from the original HLL contributors and the Datasketches founder. I noticed this was targeted to Spark 4.1.0 in JIRA — just wanted to check in and see if there’s anything I can do to help move it forward or address concerns. Happy to walk through any part of the code or design if that would be helpful — just let me know. Thanks again for your time and for maintaining the project! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org