tomscut opened a new pull request, #6964:
URL: https://github.com/apache/hadoop/pull/6964

   Reverts apache/hadoop#4901
   
   As a result of this change, maintainance can get stuck in two ways: 
   1. In order to satisfy the storage policy.
   2. In an ec block, there are more than 2 dn in Entering Maintenance state 
and dfs.namenode.maintenance.ec.replication.min >= 2.
   
   
   Here's a more complex example. We recently did maintainance on a batch of 
nodes, including host4 and host8. 
   Configuration:
   ```
   dfs.namenode.maintenance.ec.replication.min=1
   storagePolicy=HDD
   ```
   
   hdfs fsck -fs -blockId blk_-9223372035217210640
   ```
   
[blk_-9223372035217210640:DatanodeInfoWithStorage[host1:50010,DS-b9b2ea24-e69b-4a95-8a36-8b73b32003d3,DISK],
 
   
blk_-9223372035217210639:DatanodeInfoWithStorage[host2:50010,DS-dfc9b308-a493-4d9b-b1c1-a134552f089f,SSD],
 
   
blk_-9223372035217210638:DatanodeInfoWithStorage[host3:50010,DS-67669a8d-57d9-4825-8e1e-0e834d1fd47a,DISK],
 
   
blk_-9223372035217210637:DatanodeInfoWithStorage[host4:50010,DS-6826ff2a-a6e5-4676-ad40-284099652670,DISK],
 Entering Maintenance
   
blk_-9223372035217210636:DatanodeInfoWithStorage[host5:50010,DS-2e042fb1-dbc2-4ccf-ba43-da51a9ef2079,DISK],
 
   
blk_-9223372035217210635:DatanodeInfoWithStorage[host6:50010,DS-005f2bce-eb46-432f-85b0-61919554692f,DISK],
 
   
blk_-9223372035217210633:DatanodeInfoWithStorage[host7:50010,DS-cc11ce37-e121-4602-8688-ec7d45a0f276,DISK],
 
   
blk_-9223372035217210632:DatanodeInfoWithStorage[host8:50010,DS-076891a0-4166-4584-9cea-13c853cbd667,DISK]]
 Entering Maintenance
   ```
   
   Datanode log:
   ```
   2024-07-25 12:46:42,680 INFO [Command processor] 
org.apache.hadoop.hdfs.server.datanode.DataNode: processErasureCodingTasks  
BlockECReconstructionInfo(
     Recovering 
BP-1956563710-x.x.x.x-1622796911268:blk_-9223372035217210640_105868369 
     From: [host1:50010, host2:50010, host3:50010, host4:50010, host5:50010, 
host6:50010, host7:50010, host8:50010] 
     To: [[host9:50010, host10:50010])
    Block Indices: [0, 1, 2, 3, 4, 5, 7, 8]
   2024-07-25 12:46:42,680 WARN [Command processor] 
org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped 
block blk_-9223372035217210640_105868369
   java.lang.IllegalArgumentException: Reconstruction work gets too much 
targets.
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
        at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.<init>(StripedWriter.java:86)
        at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.<init>(StripedBlockReconstructor.java:47)
        at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.ErasureCodingWorker.processErasureCodingTasks(ErasureCodingWorker.java:134)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:797)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1327)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1365)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1301)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1288)
   ```
   
   In this block group, there is a block written on the SSD 
(blk_-9223372035217210639). 
   
   When doing maintainance, two blocks need to be added: one is to migrate the 
blocks of SSD to HDD(In order to satisfy the storage policy), and the other is 
to ensure at least 7 blocks during maintainance. 
   
   Then the maintainance process to get stuck.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to