ctubbsii opened a new issue, #5733: URL: https://github.com/apache/accumulo/issues/5733
This is just a rough idea, so it isn't fully fleshed out, but I wanted to outline it here, in case there's interest in pursuing it. Datasketches is a library we use for the GenerateSplits utility, to compute split points over a number of files. We could also leverage it to pre-compute a midpoint in a file when we write a file, to make splits faster. 1. Use Datasketches library to compute the midpoint row key as a file is written (using only index data is fine) 2. Store the midpoint row key in the RFile metadata, or in the metadata table with the file entry along with the file size 3. When splitting a tablet, read the midpoints and compute a new weighted mean midpoint (weighted by the file sizes) 4. Use the weighted mean midpoint as the new split point With this there is no need to read any files to compute a new split point, and could make splitting much faster. One downside is that pre-computed midpoints would not account for fencing, and that could mess with the weights. An alternative to this would be simply to continue to compute the midpoints at the last moment instead of precomputing them and storing them in the metadata, but make it more efficient by using datasketches to do it in fewer passes over the indexes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org