dlmarion commented on PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#issuecomment-1665542495

   I looked into this path where the minor compaction thread was dying because 
an invalid classloader context could not load the volume chooser to create the 
minor compaction output file.  The issue is more serious that what I initially 
thought, but thankfully the fix is easy.
   
   TLDR - A failed minor compaction thread might likely mean that subsequent 
minor compactions for a Tablet will **not** occur.  The fix is to catch 
`Exception` rather than `IOException` in `MinorCompactionTask` when creating 
the output file as in this scenario a `RuntimeException` is thrown.
   
   
   Longer analysis
   ---
   
   `TableOperationsImpl.flush` calls 
`ManagerClientServiceHandler.initiateFlush` which increments
   the value of the tables `ZTABLE_FLUSH_ID` node in ZooKeeper, and returns the 
new value. Then
   `TableOperationsImpl.flush` calls `ManagerClientServiceHandler.waitForFlush` 
passing in the flushId.
   `ManagerClientServiceHandler.waitForFlush` initially calls 
`TabletClientHandler.flush` on *every tablet server in the cluster*. If the 
client is not waiting for the flush to return then this returns and 
`TableOperationsImpl.flush` is done. If the client is waiting, then 
`ManagerClientServiceHandler` looks for tablets that are hosted or has walogs 
and whose flushId in the tablet metadata is less than the flushId returned from 
`initiateFlush`. Then `waitForFlush` continues to call 
`TabletClientHandler.flush` on the hosts that have tablets that meet this 
criteria for Long.MAX_VALUE iterations.
   
   `TabletClientHandler.flush` does nothing if the TabletServer does not have 
any tablets for the table. Otherwise it gets the current value of the tables 
`ZTABLE_FLUSH_ID`, which should be the same as the flushId returned at the 
beginning of this process. `Tablet.flush` is called on each of the Tablets for 
the Table on this TabletServer.
   
   If the Tablet has no entries in memory, then the tablet metadata is updated 
with the flushId and the variable `lastFlushID` is set to the flushId. If 
`Tablet.flush()` was called again on this Tablet with the same flushID, then it 
would do nothing.
   
   If the Tablet has entries in memory, then `Tablet.initiateMinorCompaction` 
is called, which may or may not create a MinorCompactionTask. If one is 
created, then it is executed in the minorCompactionThreadPool. 
`Tablet.initiateMinorCompaction` will not return a MinorCompactionTask if the 
tablet is closing, is closed, if there are no entries in memory, if there is 
another thread updating `lastFlushID`, or if the Tablets memory is already 
reserved for a minor compaction (meaning if one is already running).
   
   MinorCompactionTask when executed will create a new output file and call 
`TabletServer.minorCompactionStarted`, then it calls `Tablet.minorCompact()`. 
`Tablet.minorCompact` performs the minor compaction, then brings the minor 
compaction online by calling `DatafileManager.bringMinorCompactionOnline`, then 
calls `TabletMemory.finalizeMinC`. `DatafileManager.bringMinorCompactionOnline` 
sets `lastFlushID` and calls `Tablet.flushComplete` which sets `lastFlushID` 
and calls `TabletMemory.finishedMinC`.
   
   If `TabletMemory.finalizeMinC` and `TabletMemory.finishedMinC` are not 
called, then it appears that the tablet memory is already reserved for a minor 
compaction and a subsequent minor compaction will *never* start. This appears 
to be what is happening when there is an invalid context on the table as the 
call to create a new output file returns a RuntimeException. Only IOExceptions 
are caught at this location and the Thread dies due to the uncaught exception.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to