keith-turner commented on issue #4251:
URL: https://github.com/apache/accumulo/issues/4251#issuecomment-1939128844

   Created #4255 as a fix for this.  Below are some notes from debugging this.  
After the test got stuck, noticed the tserver could not recover its write ahead 
logs.  The following warn was from the tserver logs
   
   ```
   2024-02-12T10:09:39,395 53 [tserver.AssignmentHandler] WARN : exception 
trying to assign tablet +r<< null
   java.lang.RuntimeException: Error recovering tablet +r<< from log files
           at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:292) 
~[classes/:?]
           at 
org.apache.accumulo.tserver.AssignmentHandler.run(AssignmentHandler.java:140) 
~[classes/:?]
           at 
org.apache.accumulo.tserver.TabletClientHandler.lambda$loadTablet$8(TabletClientHandler.java:948)
 ~[classes/:?]
           at 
org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
 [classes/:?]
           at java.base/java.lang.Thread.run(Thread.java:829) [?:?]
   Caused by: java.io.IOException: Unable to find recovery files for extent 
+r<< logEntry: 
file:/home/accumulo/git/accumulo/test/target/mini-tests/org.apache.accumulo.test.functional.WALSunnyDayIT_test/accumulo/wal/host1+46715/a04ee587-bf44-4667-b0e1-68b4460e369f
           at 
org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:903) 
~[classes/:?]
           at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:258) 
~[classes/:?]
           ... 4 more
   ```
   
   Was wondering if the GC had deleted the walog, took a look in the GC logs 
and did not see evidence of that.
   
   ```
   
accumulo@host1:~/git/accumulo/test/target/mini-tests/org.apache.accumulo.test.functional.WALSunnyDayIT_test/logs$
 grep GarbageCollectWriteAheadLogs *
   SimpleGarbageCollector_426288884.out:2024-02-12T10:09:21,906 31 
[gc.GarbageCollectWriteAheadLogs] INFO : Fetched 3 files for 1 servers in 0.01 
seconds
   SimpleGarbageCollector_426288884.out:2024-02-12T10:09:21,948 31 
[gc.GarbageCollectWriteAheadLogs] INFO : 1 log entries scanned in 0.04 seconds
   SimpleGarbageCollector_426288884.out:2024-02-12T10:09:21,949 31 
[gc.GarbageCollectWriteAheadLogs] DEBUG: Removing UNREFERENCED WAL 
file:/home/accumulo/git/accumulo/test/target/mini-tests/org.apache.accumulo.test.functional.WALSunnyDayIT_test/accumulo/wal/host1+46715/f9934e85-cb37-4f91-8a97-30fd194b4e81
   SimpleGarbageCollector_426288884.out:2024-02-12T10:09:21,952 31 
[gc.GarbageCollectWriteAheadLogs] INFO : 1 total logs removed from 1 servers in 
0.00 seconds
   SimpleGarbageCollector_426288884.out:2024-02-12T10:09:21,952 31 
[gc.GarbageCollectWriteAheadLogs] INFO : 0 recovery logs removed
   SimpleGarbageCollector_426288884.out:2024-02-12T10:09:21,970 31 
[gc.GarbageCollectWriteAheadLogs] INFO : 0 markers removed in 0.02 seconds
   ```
   
   Also looked and found the wals existed in the file system.
   
   ```
   
accumulo@host1:~/git/accumulo/test/target/mini-tests/org.apache.accumulo.test.functional.WALSunnyDayIT_test/accumulo/wal/host1+46715$
 ls -al
   total 164
   drwxr-xr-x 2 accumulo accumulo      4 Feb 12 10:09 .
   drwxr-xr-x 3 accumulo accumulo      3 Feb 12 10:09 ..
   -rwxr-xr-x 1 accumulo accumulo 136688 Feb 12 10:09 
a04ee587-bf44-4667-b0e1-68b4460e369f
   -rwxr-xr-x 1 accumulo accumulo     86 Feb 12 10:09 
bc70a463-1bb4-4659-b38f-25cf11b5bbf0
   ```
   
   So what was missing was the sorted version of the files, those had never 
been created.  For the manager to initiate log sort, the TabletGroupWatcher 
should see the TabletManagementIterator return 
`ManagementAction.NEEDS_VOLUME_REPLACEMENT` for the root tablet.  Looking in 
the manager logs, never saw this.  It was only returning 
`NEEDS_LOCATION_UPDATE`.
   
   ```
   2024-02-12T10:09:37,272 49 [manager.Manager] TRACE: [Root Table] Shutting 
down all Tservers: false, dependentCount: 0 Extent: +r<<, state: HOSTED, goal: 
HOSTED actions:[NEEDS_LOCATION_UPDATE]
   2024-02-12T10:09:38,260 102 [bulkVer2.TabletRefresher] DEBUG: Sleeping for 
1000ms before retrying operation : compaction:+r<< waiting for 1 tservers to 
refresh their tablets metadata 
   2024-02-12T10:09:39,178 49 [manager.Manager] TRACE: [Root Table] Shutting 
down all Tservers: false, dependentCount: 0 Extent: +r<<, state: 
ASSIGNED_TO_DEAD_SERVER, goal: HOSTED actions:[NEEDS_LOCATION_UPDATE]
   2024-02-12T10:09:39,195 49 [tablet.location] DEBUG: Unassigned +r<< with 1 
walogs
   2024-02-12T10:09:39,306 49 [manager.Manager] TRACE: [Root Table] Shutting 
down all Tservers: false, dependentCount: 0 Extent: +r<<, state: UNASSIGNED, 
goal: HOSTED actions:[NEEDS_LOCATION_UPDATE]
   2024-02-12T10:09:39,310 49 [tablet.location] DEBUG: Assigned +r<< to 
host1:41953[1000019a115001c]
   2024-02-12T10:09:39,414 49 [manager.Manager] TRACE: [Root Table] Shutting 
down all Tservers: false, dependentCount: 0 Extent: +r<<, state: ASSIGNED, 
goal: HOSTED actions:[NEEDS_LOCATION_UPDATE]
   ```
   
   Looked at the TabletManagementIterator source and noticed special code block 
for when manager state is not normal.  The test sets the manager state to 
SAFE_MODE, so this was uncovering the bug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to