Yizhou-Yang commented on PR #11651: URL: https://github.com/apache/gluten/pull/11651#issuecomment-4088781439
ptal again... @jinchengchenghh > > @Yizhou-Yang Thanks for your implementation! Recently I am cherrying pick your PR and I find that the KLL implementation in Gluten has only one level and discards items in odd position when compacting. I am wondering if this implementation can meet the `accuracy` requirement. Will the relative error rate be too high? > > @jiangjiangtian Hi, the code has been fixed. Here's what changed: > > **Changelog:** > > 1. Rewrote KLL sketch with proper multi-level compaction — the old implementation only had one level and discarded odd-position items, which is essentially random downsampling with no error bound guarantee. Now it uses the standard KLL algorithm: items across multiple levels, level-0 inserts, sort-and-halve compaction promoting to higher levels, with each level-_i_ item representing 2^_i_ original values. > 2. Merge correctly combines multi-level sketches and re-compacts, instead of simple concatenation + truncation. > > **Why some tests are disabled:** > > KLL and Spark's native GK (Greenwald-Khanna) are fundamentally different algorithms. Both satisfy the `1/accuracy` error bound, but they produce **different concrete values** at the same percentile boundary. For example, given integers 1–1000, the exact 25th percentile is 250.25 — GK returns 250, KLL may return 251. Both are correct within the error bound, but a strict equality assertion like `assert(result == 250)` will fail. > > For those tests I added TestGluten in the approxpercentile suite. The off-by one problem can't be simply solved by adding more layers in partial result, I tried to not disable any test when developing, but the current version is the best I can do for now. > > The 4 disabled tests (`approx_percentile`, `summary`, `SPARK-35480`, `SPARK-32908`) all use exact-match assertions against GK's specific output. Rather than modifying upstream Spark tests, we excluded them and added a dedicated `GlutenApproximatePercentileQuerySuite` with tolerance-based assertions that validate correctness for both algorithms. > > There are some other tests in collect_list that assumes approx_percentile will fallback, changed to percentile. > > I tested only spark35/spark35smj and their slow versions, hopefully it also passes spark33 etc... PTAL again... @jinchengchenghh -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
