keith-turner commented on issue #3823:
URL: https://github.com/apache/accumulo/issues/3823#issuecomment-1781940214
Trying to work through the possibility of allows bulk import and merge to
run concurrently and it seems like it would be complex and not worth the
effort. For example consider the situation where a bulk import wants to load
file F1 into tablets T1, T2, and T3. Also concurrently a merge wants to merge
tablets T1,T2, and T3.
* bulk load F1 into T1 adding a loaded marker
* merge code sets an operation id on tablet T2
* bulk load F1 into T3 adding a loaded marker
Since the merge operation set an opid on T2, the bulk operation can not load
into it. If the merge operation were to wait on the loaded markers on T1 and
T2, then deadlock would happen. If the merge operation were to continue and
set operation ids on T1,T3 and then proceed with the merge, then it would need
to merge the loaded markers. If T1,T2,T3 were merged into tablet T4, then T4
would need load markers with ranges so that the bulk import code can reason
about the missing T2 range that is needed to complete the bulk import.
Adding ranges to the load markers introduces a lot of complexity into the
bulk import code when there is not a strong need for it. It would be easier to
somehow prevent bulk import and merge from running concurrently. Both are
quick metadata operations (unlike compactions) so there is not performance
penalty to preventing them from running concurrently. Without table locks, a
possible way to do this is to add a new metadata column that prevents merges
from starting. Bulk import could then do the following.
* For each tablet in the bulk import range set a new column that prevents
merge from starting.
* If the merge block was successfully set, then proceed with the bulk
import w/o having to worry about partial load of data and a merge happening.
If the merge block and merge opid are both observed, then one operation
needs to back out its changes and wait to let the other operation start.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]