[ 
https://issues.apache.org/jira/browse/HDFS-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433995#comment-15433995
 ] 

Manoj Govindassamy commented on HDFS-10780:
-------------------------------------------


*Summary:* 
-- I can see an issue in the current code where by the NameNode can miss out 
permantnetly to replicate a block to a DataNode. 
-- If the write pipeline doesn’t have enough DataNodes to match the expected 
replication factor, then NameNode schedules replication of the block moments 
after the block has been COMMITTED or COMPLETED.  But, if the block COMMIT at 
NameNode arrives/happens after all BlockManager::addStoredBlock() (happens when 
processing block reports from DataNodes in the current write pipeline) then 
NameNode totally misses out to replicate the block because of the way it 
manages the {[LowRedundancyBlocks}} and {{PendingReconstructionBlocks}}.

*Details:*
Here are the events and their rough timeline order I noticed in one of the 
failures.
1. Say the setup has 3 DataNodes ( DN(1..3) ) and the replication factor 
configured to be 3
2. Client is writing a block (BLK_xyz_GS1) with the constructed pipeline DN1 —> 
DN2 —> DN3
3. Say DN1 encounters an issue (like downstream communication, or interrupts or 
storage volume issue etc.,) before the block is FINALIZED
4. Client detects the pipeline issue and gets a new write pipeline DN2 —> DN3 
5. Generation stamp for the current block is bumped up (BLK_xyz_GS2)
6. Client write (BLK_xyz_GS2) to DN2 and DN3
7. NameNode (NN) is getting Incremental Block Reports (IBR) from DN1, DN2 and  
DN3
8. NN sees a mix of (BLK_xyz_GS1) and (BLK_xyz_GS2) in IBRs from DN1, DN2 and 
DN3
9. NN marks (BLK_xyz_GS1) as corrupted and puts these DNs in invalidation list 
to remove that particular block

10. After seeing the first BLK_xyz_GS2 from one the DNs (say DN2)
— — NN adds  BLK_xyz_GS2 to the respective {{DatanodeStorageInfo}} (Refer: 
{{BlockManager::addStoredBlock}} )
— — Since  BLK_xyz_GS2 is not COMMITTED from the Client yet, NN cannot move 
BLK_xyz_GS2 to COMPLETE state
— — NN does not update the {{LowRedundancyBlocks}} neededReconstruction as it 
is done only after BLK_xyz_GS2 is COMMITTED by the Client 
11. NN sees BLK_xyz_GS2 from the other DN (DN3)
— — NN adds BLK_xyz_GS2
— — At NN, BLK_xyz_GS2 is still in UNDER_CONSTRUCTION state as the Client has 
not COMMITTED yet. So, neededReconstruction is still not updated.


12. At this point, for BLK_xyz_GS2 the live replica count is 2 and the expected 
replica count is 3.
— — For the {{ReplicationMonitor}} to pick up a replica work, the intention 
must be available in {{LowRedundancyBlocks}} neededReconstruction
— — Since no event triggered the addition of intention yet, no block 
replication work scheduled yet to the missing DN1

13. Client closes the file
— — NN moves the block to COMMITTED state
— — Since there are already 2 love copies of the block, NN moves the block to 
COMPLETED state
— — But, before moving the block to COMPLETED state, NN does something like 
below 

{{BlockManager.java}}
{noformat}
public boolean commitOrCompleteLastBlock(BlockCollection bc,
    Block 
commitBlock) throws IOException {
  ...
  final boolean committed = commitBlock(lastBlock, commitBlock);
  if 
(committed && lastBlock.isStriped()) {
  ...
  if (hasMinStorage(lastBlock)) {
    if (committed) {
      //XXX: Is adding 
to {{PendingReconstructionBlocks}} list without adding to 
          //     {{LowRedundancyBlocks}} list right ?
          //     This fakes block replication task in progress without even
          //     any {{BlockReconstructionWork}} scheduled.
      
addExpectedReplicasToPending(lastBlock);
    }
    completeBlock(lastBlock, 
false);
  }
  ... 
}

{noformat}

— — Since the block is added the {{pendingReconstruction}} list without any 
actual reconstruction work, the follow on {{checkRedundancy}} mistakenly 
believes that there is sufficient redundancy for the block and does not trigger 
any further block reconstruction works.

{noformat}
         839 2016-08-23 10:57:21,450 [IPC Server handler 4 on 57980] INFO  
blockmanagement.BlockManager (BlockManager.java:checkRedundancy(4046)) - 
CheckRedun Block: blk_1073741825_1002, exp: 3, replica: *LIVE=2*, READONLY=0, 
DECOMMISSIONING=0, DECOMMISSIONED=0, CORRUPT=0, EXCESS=0, STALESTORAGE=     0, 
REDUNDANT=0, *pending: 1*
        
         840 2016-08-23 10:57:21,451 [IPC Server handler 4 on 57980] INFO  
blockmanagement.BlockManager 
(BlockManager.java:hasEnoughEffectiveReplicas(1685)) - Blk: 
blk_1073741825_1002, numEffectiveReplicas: 3, required: 3, pendingReplicaNum: 
1, placementPolicy: true

{noformat}

— — {{ReplicationMonitor}} which runs once in every 3 seconds, is also unable 
to help here as {{LowRedundancyBlocks}} list {{neededReplication}} is empty. 

— — So the test fails with following signature 

{{noformat}}
Running org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 44.057 sec <<< 
FAILURE! - in org.apache.hadoop.hdfs.serv
testRemoveVolumeBeingWritten(org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes)
  Time elapsed: 43.927 se
java.util.concurrent.TimeoutException: Timed out waiting for /test to reach 3 
replicas
Results :
Tests in error: 
  
TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWritten:637->testRemoveVolumeBeingWrittenForDatanode:714
 » Timeout
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0
[INFO] Apache Hadoop HDFS Client .......................... SUCCESS [  2.345 s]
[INFO] Apache Hadoop HDFS ................................. FAILURE [ 45.878 s]
[INFO] BUILD FAILURE
[INFO] Total time: 48.995 s
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.17:test (default-test) on 
project hadoop
{{noformat}}



> Block replication not happening on removing a volume when data being written 
> to a datanode -- TestDataNodeHotSwapVolumes fails
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10780
>                 URL: https://issues.apache.org/jira/browse/HDFS-10780
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Manoj Govindassamy
>            Assignee: Manoj Govindassamy
>
> TestDataNodeHotSwapVolumes occasionally fails in the unit test 
> testRemoveVolumeBeingWrittenForDatanode.  Data write pipeline can have issues 
> as there could be timeouts, data node not reachable etc, and in this test 
> case it was more of induced one as one of the volumes in a datanode is 
> removed while block write is in progress. Digging further in the logs, when 
> the problem happens in the write pipeline, the error recovery is not 
> happening as expected leading to block replication never catching up.
> Running org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 44.495 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.serv
> testRemoveVolumeBeingWritten(org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes)
>   Time elapsed: 44.354 se
> java.util.concurrent.TimeoutException: Timed out waiting for /test to reach 3 
> replicas
> Results :
> Tests in error: 
>   
> TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWritten:637->testRemoveVolumeBeingWrittenForDatanode:714
>  » Timeout
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0
> Following exceptions are not expected in this test run
> {noformat}
>  614 2016-08-10 12:30:11,269 [DataXceiver for client 
> DFSClient_NONMAPREDUCE_-640082112_10 at /127.0.0.1:58805 [Receiving block 
> BP-1852988604-172.16.3.66-1470857409044:blk_1073741825_1001]] DEBUG 
> datanode.Da     taNode (DataXceiver.java:run(320)) - 127.0.0.1:58789:Number 
> of active connections is: 2
>  615 java.lang.IllegalMonitorStateException
>  616         at java.lang.Object.wait(Native Method)
>  617         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.waitVolumeRemoved(FsVolumeList.java:280)
>  618         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.removeVolumes(FsDatasetImpl.java:517)
>  619         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.removeVolumes(DataNode.java:832)
>  620         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.removeVolumes(DataNode.java:798)
> {noformat}
> {noformat}
>  720 2016-08-10 12:30:11,287 [DataNode: 
> [[[DISK]file:/Users/manoj/work/ups-hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/,
>  [DISK]file:/Users/manoj/work/ups-hadoop/hadoop-hdfs-projec     
> t/hadoop-hdfs/target/test/data/dfs/data/data2/]]  heartbeating to 
> localhost/127.0.0.1:58788] ERROR datanode.DataNode 
> (BPServiceActor.java:run(768)) - Exception in BPOfferService for Block pool 
> BP-18529     88604-172.16.3.66-1470857409044 (Datanode Uuid 
> 711d58ad-919d-4350-af1e-99fa0b061244) service to localhost/127.0.0.1:58788
>  721 java.lang.NullPointerException
>  722         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1841)
>  723         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:336)
>  724         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:624)
>  725         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:766)
>  726         at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to