kevinrr888 commented on issue #5733: URL: https://github.com/apache/accumulo/issues/5733#issuecomment-3133483165
I no longer think this is a workable ticket. Let's look at the first approach. We: * Use Datasketches to compute the midpoint of an RFile as it is written * Use these midpoints with weights of the RFile size to compute split points at automatic split point computation time A detremental downside of this is that the midpoints and weights become outdated with subsequent splits and merges and it doesn't seem like there is a way to address this without opening and reading RFiles again (which we were trying to avoid). Here is an example to outline this: ------------------------ Tablet0 Range: (-inf, inf) rf1 with rows 1, 2, 3 mid point of rf1 = 2 rf2 with rows 4, 5, 6 mid point of rf2 = 5 rf3 with rows 7, 8, 9 mid point of rf3 = 8 ------------------------ | | split at 5 V ------------------------ Tablet1 Range: (-inf, 5] rf1 with rows 1, 2, 3 mid point of rf1 = 2 rf2 with rows 4, 5, 6 <--- NOTE: this rfile has keys which lie beyond the range mid point of rf2 = 5 ------------------------ Tablet2 Range: (5, inf) rf2 with rows 4, 5, 6 <--- NOTE: this rfile has keys which lie beyond the range mid point of rf2 = 5 rf3 with rows 7, 8, 9 mid point of rf3 = 8 ------------------------ When we split a tablet in the middle of some rfile, both tablets will be assigned that rfile (rf2 in this case). Lets say, an automatic split was calculated for Tablet2 at this point. Also assume that rf2 is the largest file out of rf2 and rf3. The mid point for rf2 would then be the preferred split. So we split Tablet2 at 5. This is wrong as 5 lies outside Tablet2's range. To address this, after the split at 5, for each tablet, we would have to open rf2, compute a new midpoint accounting for the tablet range, and would also need to estimate the size of rf2 which lies in the tablet range so we can use an appropriate weight. This added complexity and still needing to open rfiles for splits may be a deal-breaker for the approach... The entire benefit of this approach was avoiding opening and reading RFiles at split time. Let's look at the second approach. > An alternative to this would be simply to continue to compute the midpoints at the last moment instead of precomputing them and storing them in the metadata, but make it more efficient by using datasketches to do it in fewer passes over the indexes. The current approach only does one pass over the indexes: see `SplitUtils.findSplits`. Maybe the current implementation can still benefit from using Datasketches by using it over the single pass? After some rough timing calculations averaged over 5 runs for the current approach and an approach using Datasketches, I found that a pass over the keys using Datasketches (updating the `ItemsSketch` with `update()` each iteration) takes over 2x the time of the current single-pass implementation on average. And overall, the computation time of `findSplits` using the new approach takes about 20% longer on average due to this. There are also other problems with using Datasketches here. For example, shortening the row becomes trickier using Datasketches compared to the current approach. The current approach gets the longest common length of the previous row and the current row, this is harder to do with Datasketches (as Datasketches only uses estimations, no real simple way to get the previo us row of the split point we calculate). Another issue is not all split candidates will be added as a split--they need to pass the `rowPredicate`. This is also complicated when using Datasketches. For these reasons, I don't think this ticket is workable, and may be closed unless anyone has any other ideas or suggestions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org