tomscut opened a new pull request, #6964:
URL: https://github.com/apache/hadoop/pull/6964
Reverts apache/hadoop#4901
As a result of this change, maintainance can get stuck in two ways:
1. In order to satisfy the storage policy.
2. In an ec block, there are more than 2 dn in Entering Maintenance state
and dfs.namenode.maintenance.ec.replication.min >= 2.
Here's a more complex example. We recently did maintainance on a batch of
nodes, including host4 and host8.
Configuration:
```
dfs.namenode.maintenance.ec.replication.min=1
storagePolicy=HDD
```
hdfs fsck -fs -blockId blk_-9223372035217210640
```
[blk_-9223372035217210640:DatanodeInfoWithStorage[host1:50010,DS-b9b2ea24-e69b-4a95-8a36-8b73b32003d3,DISK],
blk_-9223372035217210639:DatanodeInfoWithStorage[host2:50010,DS-dfc9b308-a493-4d9b-b1c1-a134552f089f,SSD],
blk_-9223372035217210638:DatanodeInfoWithStorage[host3:50010,DS-67669a8d-57d9-4825-8e1e-0e834d1fd47a,DISK],
blk_-9223372035217210637:DatanodeInfoWithStorage[host4:50010,DS-6826ff2a-a6e5-4676-ad40-284099652670,DISK],
Entering Maintenance
blk_-9223372035217210636:DatanodeInfoWithStorage[host5:50010,DS-2e042fb1-dbc2-4ccf-ba43-da51a9ef2079,DISK],
blk_-9223372035217210635:DatanodeInfoWithStorage[host6:50010,DS-005f2bce-eb46-432f-85b0-61919554692f,DISK],
blk_-9223372035217210633:DatanodeInfoWithStorage[host7:50010,DS-cc11ce37-e121-4602-8688-ec7d45a0f276,DISK],
blk_-9223372035217210632:DatanodeInfoWithStorage[host8:50010,DS-076891a0-4166-4584-9cea-13c853cbd667,DISK]]
Entering Maintenance
```
Datanode log:
```
2024-07-25 12:46:42,680 INFO [Command processor]
org.apache.hadoop.hdfs.server.datanode.DataNode: processErasureCodingTasks
BlockECReconstructionInfo(
Recovering
BP-1956563710-x.x.x.x-1622796911268:blk_-9223372035217210640_105868369
From: [host1:50010, host2:50010, host3:50010, host4:50010, host5:50010,
host6:50010, host7:50010, host8:50010]
To: [[host9:50010, host10:50010])
Block Indices: [0, 1, 2, 3, 4, 5, 7, 8]
2024-07-25 12:46:42,680 WARN [Command processor]
org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped
block blk_-9223372035217210640_105868369
java.lang.IllegalArgumentException: Reconstruction work gets too much
targets.
at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
at
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.<init>(StripedWriter.java:86)
at
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.<init>(StripedBlockReconstructor.java:47)
at
org.apache.hadoop.hdfs.server.datanode.erasurecode.ErasureCodingWorker.processErasureCodingTasks(ErasureCodingWorker.java:134)
at
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:797)
at
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1327)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1365)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1301)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1288)
```
In this block group, there is a block written on the SSD
(blk_-9223372035217210639).
When doing maintainance, two blocks need to be added: one is to migrate the
blocks of SSD to HDD(In order to satisfy the storage policy), and the other is
to ensure at least 7 blocks during maintainance.
Then the maintainance process to get stuck.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]