ctubbsii commented on issue #5733: URL: https://github.com/apache/accumulo/issues/5733#issuecomment-3189677627
> Maybe the current implementation can still benefit from using Datasketches by using it over the single pass? Or just benefit from using Datasketches instead of our custom code, is what I was thinking, even if the current implementation is already a single pass. > After some rough timing calculations averaged over 5 runs for the current approach and an approach using Datasketches, I found that a pass over the keys using Datasketches (updating the `ItemsSketch` with `update()` each iteration) takes over 2x the time of the current single-pass implementation on average. And overall, the computation time of `findSplits` using the new approach takes about 20% longer on average due to this. Were you passing all the keys through the sketch or just going through the indexes, like in our current implementation? If you used all the data, I wonder if using only the indexes would help. > There are also other problems with using Datasketches here. For example, shortening the row becomes trickier using Datasketches compared to the current approach. The current approach gets the longest common length of the previous row and the current row, this is harder to do with Datasketches (as Datasketches only uses estimations, no real simple way to get the previous row of the split point we calculate). I wonder if we could write a custom sketch for this purpose, that just tracks the shortest common length for anything it chooses as a midpoint candidate. > Another issue is not all split candidates will be added as a split--they need to pass the `rowPredicate`. This is also complicated when using Datasketches. What is the rowPredicate used for? I wonder if that's still needed with a Datasketches approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org