ctubbsii commented on issue #5733:
URL: https://github.com/apache/accumulo/issues/5733#issuecomment-3189677627

   > Maybe the current implementation can still benefit from using Datasketches 
by using it over the single pass?
   
   Or just benefit from using Datasketches instead of our custom code, is what 
I was thinking, even if the current implementation is already a single pass.
   
   > After some rough timing calculations averaged over 5 runs for the current 
approach and an approach using Datasketches, I found that a pass over the keys 
using Datasketches (updating the `ItemsSketch` with `update()` each iteration) 
takes over 2x the time of the current single-pass implementation on average. 
And overall, the computation time of `findSplits` using the new approach takes 
about 20% longer on average due to this.
   
   Were you passing all the keys through the sketch or just going through the 
indexes, like in our current implementation? If you used all the data, I wonder 
if using only the indexes would help.
   
   > There are also other problems with using Datasketches here. For example, 
shortening the row becomes trickier using Datasketches compared to the current 
approach. The current approach gets the longest common length of the previous 
row and the current row, this is harder to do with Datasketches (as 
Datasketches only uses estimations, no real simple way to get the previous row 
of the split point we calculate).
   
   I wonder if we could write a custom sketch for this purpose, that just 
tracks the shortest common length for anything it chooses as a midpoint 
candidate.
   
   > Another issue is not all split candidates will be added as a split--they 
need to pass the `rowPredicate`. This is also complicated when using 
Datasketches.
   
   What is the rowPredicate used for? I wonder if that's still needed with a 
Datasketches approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to