Re: [I] Idea: Use datasketches to compute a "mean row key" when writing an RFile to compute split points [accumulo]

via GitHub Tue, 29 Jul 2025 10:50:20 -0700


kevinrr888 commented on issue #5733:
URL: https://github.com/apache/accumulo/issues/5733#issuecomment-3133483165


   I no longer think this is a workable ticket.
   
   Let's look at the first approach. We:
   * Use Datasketches to compute the midpoint of an RFile as it is written
   * Use these midpoints with weights of the RFile size to compute split points 
at automatic split point computation time
   
   A detremental downside of this is that the midpoints and weights become 
outdated with subsequent splits and merges and it doesn't seem like there is a 
way to address this without opening and reading RFiles again (which we were 
trying to avoid). Here is an example to outline this:
   
   ------------------------
   Tablet0
   Range: (-inf, inf)
   rf1 with rows 1, 2, 3
   mid point of rf1 = 2
   rf2 with rows 4, 5, 6
   mid point of rf2 = 5
   rf3 with rows 7, 8, 9
   mid point of rf3 = 8
   ------------------------
    |
    |  split at 5
    V
   ------------------------
   Tablet1
   Range: (-inf, 5]
   rf1 with rows 1, 2, 3
   mid point of rf1 = 2
   rf2 with rows 4, 5, 6 <--- NOTE: this rfile has keys which lie beyond the 
range
   mid point of rf2 = 5
   ------------------------
   Tablet2
   Range: (5, inf)
   rf2 with rows 4, 5, 6 <--- NOTE: this rfile has keys which lie beyond the 
range
   mid point of rf2 = 5
   rf3 with rows 7, 8, 9
   mid point of rf3 = 8
   ------------------------
   When we split a tablet in the middle of some rfile, both tablets will be 
assigned that rfile (rf2 in this case). Lets say, an automatic split was 
calculated for Tablet2 at this point. Also assume that rf2 is the largest file 
out of rf2 and rf3. The mid point for rf2 would then be the preferred split. So 
we split Tablet2 at 5. This is wrong as 5 lies outside Tablet2's range.
   To address this, after the split at 5, for each tablet, we would have to 
open rf2, compute a new midpoint accounting for the tablet range, and would 
also need to estimate the size of rf2 which lies in the tablet range so we can 
use an appropriate weight. This added complexity and still needing to open 
rfiles for splits may be a deal-breaker for the approach... The entire benefit 
of this approach was avoiding opening and reading RFiles at split time.
   
   Let's look at the second approach.
   > An alternative to this would be simply to continue to compute the 
midpoints at the last moment instead of precomputing them and storing them in 
the metadata, but make it more efficient by using datasketches to do it in 
fewer passes over the indexes.
   
   The current approach only does one pass over the indexes: see 
`SplitUtils.findSplits`. Maybe the current implementation can still benefit 
from using Datasketches by using it over the single pass? After some rough 
timing calculations averaged over 5 runs for the current approach and an 
approach using Datasketches, I found that a pass over the keys using 
Datasketches (updating the `ItemsSketch` with `update()` each iteration) takes 
over 2x the time of the current single-pass implementation on average. And 
overall, the computation time of `findSplits` using the new approach takes 
about 20% longer on average due to this. There are also other problems with 
using Datasketches here. For example, shortening the row becomes trickier using 
Datasketches compared to the current approach. The current approach gets the 
longest common length of the previous row and the current row, this is harder 
to do with Datasketches (as Datasketches only uses estimations, no real simple 
way to get the previo
 us row of the split point we calculate). Another issue is not all split 
candidates will be added as a split--they need to pass the `rowPredicate`. This 
is also complicated when using Datasketches.
   
   For these reasons, I don't think this ticket is workable, and may be closed 
unless anyone has any other ideas or suggestions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] Idea: Use datasketches to compute a "mean row key" when writing an RFile to compute split points [accumulo]

Reply via email to