[ https://issues.apache.org/jira/browse/HDFS-16583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stephen O'Donnell reopened HDFS-16583: -------------------------------------- Reopening to add a branch-3.2 PR. > DatanodeAdminDefaultMonitor can get stuck in an infinite loop > ------------------------------------------------------------- > > Key: HDFS-16583 > URL: https://issues.apache.org/jira/browse/HDFS-16583 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Stephen O'Donnell > Assignee: Stephen O'Donnell > Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.4 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > We encountered a case where the decommission monitor in the namenode got > stuck for about 6 hours. The logs give: > {code} > 2022-05-15 01:09:25,490 INFO > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping > maintenance of dead node 10.185.3.132:50010 > 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process > Thread Dump: jsp requested > <Thread dump> > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753665_3428271426 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753659_3428271420 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753662_3428271423 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753663_3428271424 > 2022-05-15 06:00:57,281 INFO > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping > maintenance of dead node 10.185.3.34:50010 > 2022-05-15 06:00:58,105 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock > held for 17492614 ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601) > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > Number of suppressed write-lock reports: 0 > Longest write-lock held interval: 17492614 > {code} > We only have the one thread dump triggered by the FC: > {code} > Thread 80 (DatanodeAdminMonitor-0): > State: RUNNABLE > Blocked count: 16 > Waited count: 453693 > Stack: > > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538) > > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > This was the line of code: > {code} > private void check() { > final Iterator<Map.Entry<DatanodeDescriptor, AbstractList<BlockInfo>>> > it = new CyclicIteration<>(outOfServiceNodeBlocks, > iterkey).iterator(); > final LinkedList<DatanodeDescriptor> toRemove = new LinkedList<>(); > while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem > .isRunning()) { > numNodesChecked++; > final Map.Entry<DatanodeDescriptor, AbstractList<BlockInfo>> > entry = it.next(); > final DatanodeDescriptor dn = entry.getKey(); > AbstractList<BlockInfo> blocks = entry.getValue(); > boolean fullScan = false; > if (dn.isMaintenance() && dn.maintenanceExpired()) { > // If maintenance expires, stop tracking it. > stopMaintenance(dn); > toRemove.add(dn); > continue; > } > if (dn.isInMaintenance()) { > // The dn is IN_MAINTENANCE and the maintenance hasn't expired yet. > continue; //// >>> This line > } > {code} > With only one sample, we cannot figure out for sure if it is somehow stuck in > an infinite loop, but I suspect it is. > The problem is two fold: > 1) When we call stopMaintenance(dn), which we must have done as it logged the > "Stopping maintenance of dead node", the code looks like: > {code} > if (dn.isMaintenance() && dn.maintenanceExpired()) { > // If maintenance expires, stop tracking it. > stopMaintenance(dn); > toRemove.add(dn); > continue; > } > {code} > Notice we add the DN to the toRemove list - this is to avoid concurrent > modifications on the outOfServiceNodeBlocks map. However inside > stopMaintenance we see: > {code} > pendingNodes.remove(node); > outOfServiceNodeBlocks.remove(node); > {code} > This should cause a concurrentModification exception, which I think will > leave the decommission monitor broken until the NN is restarted normally > (again I am not sure). > 2) In some cases a ConcurrentModificationException is not thrown. This is > because the iterator is a CyclicIterator. If the DN removed from maintenance > is the one after the iterator start key, and its the second last element in > the navigatable map, I believe the CyclicIterator will go into an infinite > loop. No concurrent modification exception will get thrown, as the TreeMap > iterator has no next(), so it creates a new TreeMap iterator, and then the > "stop key" for the cyclic iterator is never found in the map as its now gone, > and it just cycles forever. I think that is what happened here. > This test in TestCyclicIteration proves the infinite loop is possible: > {code} > @Test > public void testInfiniteLoop() { > final NavigableMap<Integer, Integer> map = new TreeMap<Integer, > Integer>(); > map.put(0, 0); > map.put(2, 2); > map.put(4, 4); > map.put(6, 6); > System.out.println("map=" + map); > // Create iterator to start at 2, which means the first element returned > // should be 4. > // Inside the treemap iterator, next will already be set to 6 > // after 4 is returned. > final Iterator<Map.Entry<Integer, Integer>> > it = new CyclicIteration<>(map, 2).iterator(); > int i = 0; > while(it.hasNext()) { > Map.Entry<Integer, Integer> e = it.next(); > if (i == 0) { > map.remove(4); > } > System.out.println("Iteration #" + i++); > } > } > {code} > The root cause of the decommission monitor problem is the > concurrentModification of the outOfServiceNodeBlocks - so we should avoid > that. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org