[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile
[ https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976233#comment-14976233 ] Yi Liu commented on HDFS-9275: -- Walter, sorry I forgot this JIRA :-) For continuous block, if n replicas are missed (for total 3 replicas, at most 2 can be missed, so n <3), we will check the total of replicas in PendingReplicationBlocks to see whether we need to schedule new block replication. For block reconstruction of striped block, ideally we should follow this, for any missed striped internal block, we just need to reconstruct 1, so we should check whether there is 1 in pendingReplicationBlocks, but currently we track the block group in the list. Then it becomes we compare the total missed striped internal blocks with the number in PendingReplicationBlocks, if there are more than two missed striped internal blocks and one is reconstructed first, then there may be some unnecessary reconstruction. I think we can do a simple improvement for striped block, if there is one in PendingReplicationBlocks, then we don't schedule new reconstruction work instead of comparing the number of missed striped internal blocks. > Fix TestRecoverStripedFile > -- > > Key: HDFS-9275 > URL: https://issues.apache.org/jira/browse/HDFS-9275 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch, > HDFS-9275.03.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile
[ https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976289#comment-14976289 ] Kai Zheng commented on HDFS-9275: - Thanks Walter for the JIRA and working on this. Would you give the description for what's the exact issue or cause this issue is trying to address? It may help others understand the issue in short. Thanks. > Fix TestRecoverStripedFile > -- > > Key: HDFS-9275 > URL: https://issues.apache.org/jira/browse/HDFS-9275 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch, > HDFS-9275.03.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile
[ https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968820#comment-14968820 ] Walter Su commented on HDFS-9275: - Thanks [~hitliuyi] for review. bq. please add some description in the JIRA description part about what issue in the Test you want to fix? It's better to have Jenkins failure link if there is. https://builds.apache.org/job/PreCommit-HDFS-Build/13110/testReport/org.apache.hadoop.hdfs/TestRecoverStripedFile/testRecoverAnyBlocks1/ Unfortunately, the logs is of no use. We know some block didn't get recovered but don't know why from the logs. The logs is truncated. It took me a while to reproduce locally. I can upload the logs(488kb) if you want. BTW, I'll appreciate if someone can tell me where can I download the full log of the failed jenkins run. > Fix TestRecoverStripedFile > -- > > Key: HDFS-9275 > URL: https://issues.apache.org/jira/browse/HDFS-9275 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: test >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile
[ https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968797#comment-14968797 ] Yi Liu commented on HDFS-9275: -- Walter, please add some description in the JIRA description part about what issue in the Test you want to fix? It's better to have Jenkins failure link if there is. I see you replace {{setDataNodesDead}} and {{readReplica}} with some existing methods in other Test Utils, that's good. In the tests, originally I restart the datanodes at the end of tests because I intended to share same mini cluster to make tests faster, but I forgot to use {{@BeforeClass}} instead of {{Before}} which will do the restart for each test. I am OK to just remove starting shutdown DNs at the end. > Fix TestRecoverStripedFile > -- > > Key: HDFS-9275 > URL: https://issues.apache.org/jira/browse/HDFS-9275 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: test >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile
[ https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970093#comment-14970093 ] Yi Liu commented on HDFS-9275: -- Thanks Walter for the update, I will review it tomorrow since I'm OOO today if there is no other review. > Fix TestRecoverStripedFile > -- > > Key: HDFS-9275 > URL: https://issues.apache.org/jira/browse/HDFS-9275 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch, > HDFS-9275.03.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile
[ https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969341#comment-14969341 ] Walter Su commented on HDFS-9275: - I keep digging, then I understand the whole steps: # When client is writing blockGroup_0, DN1 sends a heartbeat, its xceiverCount=3 # Client finished writing blockGroup_0, blockGroup_1 # Shutdown DN8~10. So idx_6~8 of blockGroup_1 is missing. # ReplicationMonitor schedules 1st recovery for blockGroup_1, because DN1 is busy(See previous comments), BlockPlacementPolicy choose DN0,DN11 as targets. # ErasureCodingWorker recovers idx_6 at DN0, and idx_7 at DN11. (See getTargetIndices() you'll know why) # Before idx_6,7 are reported, ReplicationMonitor schedules 2nd recovery for blockGroup_1. It choose DN0 as targets. # ErasureCodingWorker tries to recover idx_6 at DN0, it failed because DN0 complains replica exists. A delayed heartbeat is the direct cause for the failed tests. The deep cause is, It's not about the test code, It's about the defects of handling 2 concurrent EC recovery tasks: # Defect in ReplicationMonitor. It shouldn't choose one DataNode as target twice for the same block. # Defect in ErasureCodingWorker. It doesn't know which internal blocks is in recovering, or recovered. It purely guesses from live nodes. > Fix TestRecoverStripedFile > -- > > Key: HDFS-9275 > URL: https://issues.apache.org/jira/browse/HDFS-9275 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: test >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile
[ https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969353#comment-14969353 ] Walter Su commented on HDFS-9275: - Cancel 02 patch. Upload 03 patch. It should fix this. Just initial patch. > Fix TestRecoverStripedFile > -- > > Key: HDFS-9275 > URL: https://issues.apache.org/jira/browse/HDFS-9275 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch, > HDFS-9275.03.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile
[ https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968588#comment-14968588 ] Walter Su commented on HDFS-9275: - ||DN0||DN1||DN2||DN3||DN4||DN5||DN6||DN7||DN8||DN9||DN10||DN11 | |*|*|*|*|*|*|*|*|*| | | <-- BlockGroup_0 | | |*|*|*|*|*|*|*|*|*| | <-- BlockGroup_1 The test case only tests last block group. Suppose DN8~10 are shutdown. ReplicationMonitor will schedule a recovery. Firstly need to call BlockPlacementPolicy to choose targets. DN2~DN10 are excluded because they already have internal blocks on them. To recover 3 blocks, it must choose DN0, DN1, DN11. But DN1 has a block belonging to BlockGroup_0. The last time DN1 sent a heartbeat, it reported its {{xceiverCount}} is 3. {{xceiverCount}} is equals to the active thread in DataNode.threadGroup, as show below. {noformat} DatanodeRegistration(127.0.0.1:47705, datanodeUuid=43e5be32-2066-4057-9b25-8544d2d542bc, infoPort=43445, infoSecurePort=0, ipcPort=34036, storageInfo=lv=-56;cid=testClusterID;nsid=23260287;c=1445489667626) java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10] Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@6aa03871,5,dataXceiverServer] Thread[DataXceiver for client DFSClient_NONMAPREDUCE_-1867405584_1 at /127.0.0.1:56717 [Receiving block BP-1612020377-9.96.1.34-1445489667626:blk_-9223372036854775791_1001],5,dataXceiverServer] Thread[PacketResponder: BP-1612020377-9.96.1.34-1445489667626:blk_-9223372036854775791_1001, type=LAST_IN_PIPELINE, downstreams=0:[],5,dataXceiverServer] {noformat} {{xceiverCount}} equals to 3 is lager than average number, so DN1 is excluded by {{chooseRandom()}}. Then BlockGroup_1 can only recover 2 blocks. As discussed [here|https://issues.apache.org/jira/browse/HDFS-8220?focusedCommentId=14518931=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14518931] , now temporarily PlacementPolicy doesn't support return two identical storages, aka, no 2 replicas(internal blocks) in the same storage. We could simply add more DNs to fix the test. Or we can set {{DFS_NAMENODE_REPLICATION_CONSIDERLOAD_KEY}} to false in the test case. The 02 patch includes some clean up. Kindly review. Thanks. > Fix TestRecoverStripedFile > -- > > Key: HDFS-9275 > URL: https://issues.apache.org/jira/browse/HDFS-9275 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: test >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile
[ https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968587#comment-14968587 ] Hadoop QA commented on HDFS-9275: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 8m 5s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:green}+1{color} | javac | 8m 3s | There were no new javac warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 24s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 1s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 28s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 32s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | native | 1m 6s | Pre-build of native portion | | {color:green}+1{color} | hdfs tests | 50m 37s | Tests passed in hadoop-hdfs. | | | | 74m 19s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12767952/HDFS-9275.02.patch | | Optional Tests | javac unit findbugs checkstyle | | git revision | trunk / 381610d | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13121/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | whitespace | https://builds.apache.org/job/PreCommit-HDFS-Build/13121/artifact/patchprocess/whitespace.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13121/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13121/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf900.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13121/console | This message was automatically generated. > Fix TestRecoverStripedFile > -- > > Key: HDFS-9275 > URL: https://issues.apache.org/jira/browse/HDFS-9275 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: test >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)