[I] Idea: Use datasketches to compute a "mean row key" when writing an RFile to compute split points [accumulo]

via GitHub Wed, 09 Jul 2025 14:54:20 -0700


ctubbsii opened a new issue, #5733:
URL: https://github.com/apache/accumulo/issues/5733


   This is just a rough idea, so it isn't fully fleshed out, but I wanted to 
outline it here, in case there's interest in pursuing it.
   
   Datasketches is a library we use for the GenerateSplits utility, to compute 
split points over a number of files. We could also leverage it to pre-compute a 
midpoint in a file when we write a file, to make splits faster.
   
   1. Use Datasketches library to compute the midpoint row key as a file is 
written (using only index data is fine)
   2. Store the midpoint row key in the RFile metadata, or in the metadata 
table with the file entry along with the file size
   3. When splitting a tablet, read the midpoints and compute a new weighted 
mean midpoint (weighted by the file sizes)
   4. Use the weighted mean midpoint as the new split point
   
   With this there is no need to read any files to compute a new split point, 
and could make splitting much faster.
   
   One downside is that pre-computed midpoints would not account for fencing, 
and that could mess with the weights.
   
   An alternative to this would be simply to continue to compute the midpoints 
at the last moment instead of precomputing them and storing them in the 
metadata, but make it more efficient by using datasketches to do it in fewer 
passes over the indexes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Idea: Use datasketches to compute a "mean row key" when writing an RFile to compute split points [accumulo]

Reply via email to