Yizhou-Yang commented on PR #11651:
URL: https://github.com/apache/gluten/pull/11651#issuecomment-4088781439

   ptal again... @jinchengchenghh 
   
   > > @Yizhou-Yang Thanks for your implementation! Recently I am cherrying 
pick your PR and I find that the KLL implementation in Gluten has only one 
level and discards items in odd position when compacting. I am wondering if 
this implementation can meet the `accuracy` requirement. Will the relative 
error rate be too high?
   > 
   > @jiangjiangtian Hi, the code has been fixed. Here's what changed:
   > 
   > **Changelog:**
   > 
   > 1. Rewrote KLL sketch with proper multi-level compaction — the old 
implementation only had one level and discarded odd-position items, which is 
essentially random downsampling with no error bound guarantee. Now it uses the 
standard KLL algorithm: items across multiple levels, level-0 inserts, 
sort-and-halve compaction promoting to higher levels, with each level-_i_ item 
representing 2^_i_ original values.
   > 2. Merge correctly combines multi-level sketches and re-compacts, instead 
of simple concatenation + truncation.
   > 
   > **Why some tests are disabled:**
   > 
   > KLL and Spark's native GK (Greenwald-Khanna) are fundamentally different 
algorithms. Both satisfy the `1/accuracy` error bound, but they produce 
**different concrete values** at the same percentile boundary. For example, 
given integers 1–1000, the exact 25th percentile is 250.25 — GK returns 250, 
KLL may return 251. Both are correct within the error bound, but a strict 
equality assertion like `assert(result == 250)` will fail.
   > 
   > For those tests I added TestGluten in the approxpercentile suite. The 
off-by one problem can't be simply solved by adding more layers in partial 
result, I tried to not disable any test when developing, but the current 
version is the best I can do for now.
   > 
   > The 4 disabled tests (`approx_percentile`, `summary`, `SPARK-35480`, 
`SPARK-32908`) all use exact-match assertions against GK's specific output. 
Rather than modifying upstream Spark tests, we excluded them and added a 
dedicated `GlutenApproximatePercentileQuerySuite` with tolerance-based 
assertions that validate correctness for both algorithms.
   > 
   > There are some other tests in collect_list that assumes approx_percentile 
will fallback, changed to percentile.
   > 
   > I tested only spark35/spark35smj and their slow versions, hopefully it 
also passes spark33 etc...
   
   PTAL again... @jinchengchenghh 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to