[jira] [Commented] (HDFS-10780) Block replication not happening on removing a volume when data being written to a datanode -- TestDataNodeHotSwapVolumes fails

Manoj Govindassamy (JIRA) Tue, 23 Aug 2016 00:37:49 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432306#comment-15432306
 ]


Manoj Govindassamy commented on HDFS-10780:
-------------------------------------------

[~shahrs87], I do see HDFS-9781 (NPE during Full Block Report and especially 
after a volume removal) quite frequently in my test 
(TestDataNodeHotSwapVolumes) runs. But for these tests Incremental Block 
Reports from DataNodes are sufficient and they do work as expected. Block 
Report generations are happening within a try catch block and they are ignoring 
any encountered exceptions. Thanks for pointing me to the other jira, will 
follow up on that as well.


> Block replication not happening on removing a volume when data being written 
> to a datanode -- TestDataNodeHotSwapVolumes fails
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10780
>                 URL: https://issues.apache.org/jira/browse/HDFS-10780
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Manoj Govindassamy
>            Assignee: Manoj Govindassamy
>
> TestDataNodeHotSwapVolumes occasionally fails in the unit test 
> testRemoveVolumeBeingWrittenForDatanode.  Data write pipeline can have issues 
> as there could be timeouts, data node not reachable etc, and in this test 
> case it was more of induced one as one of the volumes in a datanode is 
> removed while block write is in progress. Digging further in the logs, when 
> the problem happens in the write pipeline, the error recovery is not 
> happening as expected leading to block replication never catching up.
> Running org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 44.495 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.serv
> testRemoveVolumeBeingWritten(org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes)
>   Time elapsed: 44.354 se
> java.util.concurrent.TimeoutException: Timed out waiting for /test to reach 3 
> replicas
> Results :
> Tests in error: 
>   
> TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWritten:637->testRemoveVolumeBeingWrittenForDatanode:714
>  » Timeout
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0
> Following exceptions are not expected in this test run
> {noformat}
>  614 2016-08-10 12:30:11,269 [DataXceiver for client 
> DFSClient_NONMAPREDUCE_-640082112_10 at /127.0.0.1:58805 [Receiving block 
> BP-1852988604-172.16.3.66-1470857409044:blk_1073741825_1001]] DEBUG 
> datanode.Da     taNode (DataXceiver.java:run(320)) - 127.0.0.1:58789:Number 
> of active connections is: 2
>  615 java.lang.IllegalMonitorStateException
>  616         at java.lang.Object.wait(Native Method)
>  617         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.waitVolumeRemoved(FsVolumeList.java:280)
>  618         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.removeVolumes(FsDatasetImpl.java:517)
>  619         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.removeVolumes(DataNode.java:832)
>  620         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.removeVolumes(DataNode.java:798)
> {noformat}
> {noformat}
>  720 2016-08-10 12:30:11,287 [DataNode: 
> [[[DISK]file:/Users/manoj/work/ups-hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/,
>  [DISK]file:/Users/manoj/work/ups-hadoop/hadoop-hdfs-projec     
> t/hadoop-hdfs/target/test/data/dfs/data/data2/]]  heartbeating to 
> localhost/127.0.0.1:58788] ERROR datanode.DataNode 
> (BPServiceActor.java:run(768)) - Exception in BPOfferService for Block pool 
> BP-18529     88604-172.16.3.66-1470857409044 (Datanode Uuid 
> 711d58ad-919d-4350-af1e-99fa0b061244) service to localhost/127.0.0.1:58788
>  721 java.lang.NullPointerException
>  722         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1841)
>  723         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:336)
>  724         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:624)
>  725         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:766)
>  726         at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-10780) Block replication not happening on removing a volume when data being written to a datanode -- TestDataNodeHotSwapVolumes fails

Reply via email to