[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131507#comment-16131507 ] SammiChen commented on HDFS-9822: - Hi [~andrew.wang], so far I haven't reproduced the issue yet. I will try to see if it can meet the beta1 timeline. > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: SammiChen > Labels: hdfs-ec-3.0-nice-to-have > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130881#comment-16130881 ] Hadoop QA commented on HDFS-9822: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 6s{color} | {color:red} HDFS-9822 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HDFS-9822 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12791223/HDFS-9822-002.patch | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/20742/console | | Powered by | Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: SammiChen > Labels: hdfs-ec-3.0-nice-to-have > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130860#comment-16130860 ] Andrew Wang commented on HDFS-9822: --- Hey Sammi, are you still planning to work on this for beta1? > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: SammiChen > Labels: hdfs-ec-3.0-nice-to-have > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950495#comment-15950495 ] Rakesh R commented on HDFS-9822: [~Sammi], sure please feel free to take this. I will help in reviews/discussions. Probably, you could dig more into the code. Also, please refer the test case in the attached patch, it may give some hint. > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Labels: hdfs-ec-3.0-nice-to-have > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950385#comment-15950385 ] Hadoop QA commented on HDFS-9822: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 8s{color} | {color:red} HDFS-9822 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HDFS-9822 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12791223/HDFS-9822-002.patch | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/18924/console | | Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Labels: hdfs-ec-3.0-nice-to-have > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950376#comment-15950376 ] SammiChen commented on HDFS-9822: - Hi [~rakeshr], do you have plan to continue working on it? If you don't have time, maybe I can take it. > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Labels: hdfs-ec-3.0-nice-to-have > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817023#comment-15817023 ] Hadoop QA commented on HDFS-9822: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 4s{color} | {color:red} HDFS-9822 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HDFS-9822 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12791223/HDFS-9822-002.patch | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/18146/console | | Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Labels: hdfs-ec-3.0-nice-to-have > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816843#comment-15816843 ] Andrew Wang commented on HDFS-9822: --- This one looks like a good bug, adding to "nice-to-have" and it might get upgrade to "must-do". [~rakeshr] do you still have plans to work on it? > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Labels: hdfs-ec-3.0-nice-to-have > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189704#comment-15189704 ] Rakesh R commented on HDFS-9822: Thanks [~zhz] for the advice. I will consider this in the next patch. > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189690#comment-15189690 ] Rakesh R commented on HDFS-9822: Thanks Walter, Li Bo. If I remember correctly, I had observed once in my env during block corruption testing. Let me try to analyse more and get back to you. > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186974#comment-15186974 ] Walter Su commented on HDFS-9822: - bq. I am still a little confused how this error happens. Me too. I don't think we get the right cause. bq. But if there are same block group entry exists in different queue.. No 2 queues can have same BG. The update(..) logic is correct. No queue can has 2 same items. The queue is a HashSet. My pure guess is that it's caused by race condition. We have a guard at {code} // BlockManager#scheduleReconstruction(..) if (block.isStriped()) { if (pendingNum > 0) { // Wait the previous reconstruction to finish. return null; } {code} which is inside namesystem lock. But before {{ReplicationMonitor}} thread goes to {{validateReconstructionWork(..)}}, it loses the lock. So it's possible the junit thread get the lock. If they both passes the guard, eventually one of them will failed the assert. > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186510#comment-15186510 ] Li Bo commented on HDFS-9822: - hi,Rakesh After reading the code of {{UnderReplicationBlocks}} I am still a little confused how this error happens. Since the situation is difficult to reproduce, how about creating a unit test case that simulates the error situation? > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186150#comment-15186150 ] Kai Zheng commented on HDFS-9822: - BIN ZHOU => [~zhz]. > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186145#comment-15186145 ] Kai Zheng commented on HDFS-9822: - Thanks [~zhzhoubin]. bq. I don't think we should schedule all EC tasks in the same queue. Yeah I agree. It's my confusing and I can be clear. What I meant is, striped blocks can be tracked in separate set of queues dedicated to striping files in the unit of block group, instead of internal blocks in groups. So if a group loses 3 internal blocks, then only one entry instead of 3 are maintained in the queue(s). bq. If a block group has lost 3 internal blocks, we should treat it with higher priority than one that has lost 1. That's right. So when new internal block is reported bad, then the existing entry for the block group will merge this one, and will not create new entry, right. I think in this way it can basically avoid having multiple reconstruction tasks to be generated. > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186017#comment-15186017 ] Zhe Zhang commented on HDFS-9822: - Interesting case here. I think it happens when 1) a under replicated striped block is put into a queue; 2) the block group loses another internal block; 3) then we put another entry in a higher priority queue for the same block group. To answer Kai's question: I don't think we should schedule all EC tasks in the same queue. If a block group has lost 3 internal blocks, we should treat it with higher priority than one that has lost 1. Thanks for the fix [~rakeshr]. The below logic only checks the {{oldPri}} queue. I guess the block could be in other queues as well? {code} else if (block.isStriped() && !priorityQueues.get(oldPri).contains(block)) { {code} > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178689#comment-15178689 ] Hadoop QA commented on HDFS-9822: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 49s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 23s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 3s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 12s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 17s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 0s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 51s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 46s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 44s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 0s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 13s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 0s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 41s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 62m 8s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 26s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 162m 54s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_72 Failed junit tests | hadoop.hdfs.server.datanode.TestDirectoryScanner | | | hadoop.hdfs.TestEncryptionZones | | JDK v1.7.0_95 Failed junit tests | hadoop.hdfs.server.namenode.TestEditLog | | | hadoop.tracing.TestTracing | | | hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12791223/HDFS-9822-002.patch | | JIRA Issue | HDFS-9822 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux eed7f176f6ca 3.13.0-36-lowlatency #63-Ubuntu SMP
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178302#comment-15178302 ] Rakesh R commented on HDFS-9822: Thanks a lot [~drankye] for the interests and useful comments. bq. 1. Why multiple reconstruction tasks for the same striped block or block group are figured out and put into queues? I have come across a situation while testing corrupted striped blocks. I think its not a straight scenario and unfortunately this occurred only once in my env. Please see the below logs, here same block group {{9223372036854775792_1001}} is added to two different priority queues. Initially the block {{9223372036854775792_1001}} has added to the neededReplications {{priority queue 2}}. Second time, while reporting the addStoredBlock request the same block group {{9223372036854775792_1001}} is added to the neededReplications {{priority queue 1}} {code} 2016-03-03 11:42:42,544 DEBUG BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_-9223372036854775792 added as corrupt on 127.0.0.1:7517 by null because TEST 2016-03-03 11:42:42,545 DEBUG org.apache.hadoop.hdfs.StateChange: UnderReplicationBlocks.update blk_-9223372036854775792_1001 curReplicas 8 curExpectedReplicas 9 oldReplicas 9 oldExpectedReplicas 9 curPri 2 oldPri 3 2016-03-03 11:42:42,545 DEBUG BlockStateChange: BLOCK* NameSystem.UnderReplicationBlock.update: blk_-9223372036854775792_1001 has only 8 replicas and needs 9 replicas so is added to neededReplications at priority level 2 {code} {code} 2016-03-03 11:42:42,920 WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock request received for blk_-9223372036854775792_1001 on node 127.0.0.1:7517 size 786432 2016-03-03 11:42:42,921 DEBUG org.apache.hadoop.hdfs.StateChange: UnderReplicationBlocks.update blk_-9223372036854775792_1001 curReplicas 7 curExpectedReplicas 9 oldReplicas 7 oldExpectedReplicas 9 curPri 1 oldPri 1 2016-03-03 11:42:42,921 DEBUG BlockStateChange: BLOCK* NameSystem.UnderReplicationBlock.update: blk_-9223372036854775792_1001 has only 7 replicas and needs 9 replicas so is added to neededReplications at priority level 1 {code} bq. 2. Is it possible to maintain a separate queue for striped block groups, where a block group is ensured to be put into exactly once As we know, there could be situations of both contiguous and striped under replicated blocks exists in the system at a time. Currently while choosing the under replicated blocks for reconstruction, there is a natural ordering of both contiguous and striped blocks. Providing a separate queue is an interesting idea. Just a quick thought, with a separate queue for the striped blocks, I'm thinking how efficiently we will be able to maintain the ordering between the under replicated contiguous and striped blocks. > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Attachments: HDFS-9822-001.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177980#comment-15177980 ] Kai Zheng commented on HDFS-9822: - Thanks [~rakeshr] for working on this. I have some questions. 1. Why multiple reconstruction tasks for the same striped block or block group are figured out and put into queues? 2. Is it possible to maintain a separate queue for striped block groups, where a block group is ensured to be put into exactly once. Whenever a striped block in the block group is reported missed/corrupt, the block group is identified and checked if a task for it is already in it or not, and only add one for the first time. This way we can avoid this issue completely. An extra benefit is, if any admin would query how many striped block groups are in question, it can be easily figured out. Ping [~zhz] for the discussion. > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Attachments: HDFS-9822-001.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177752#comment-15177752 ] Hadoop QA commented on HDFS-9822: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 7s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 51s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 55s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 15s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 50s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 53s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 53s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 41s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 54s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 13s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 54s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 56s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 60m 34s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 154m 32s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_72 Failed junit tests | hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency | | | hadoop.hdfs.server.mover.TestStorageMover | | | hadoop.hdfs.server.namenode.TestEditLog | | | hadoop.hdfs.security.TestDelegationTokenForProxyUser | | | hadoop.hdfs.TestFileAppend | | JDK v1.7.0_95 Failed junit tests | hadoop.metrics2.sink.TestRollingFileSystemSinkWithSecureHdfs | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12791135/HDFS-9822-001.patch | | JIRA Issue | HDFS-9822 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite
[jira] [Commented] (HDFS-9822) Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped block at the same time
[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177577#comment-15177577 ] Rakesh R commented on HDFS-9822: {{BlockManager#computeReconstructionWorkForBlocks()}} is getting all the blocks to be reconstructed, for each priority and schedules reconstruction tasks together. IIUC, striped blocks won't schedule multiple reconstruction tasks at a time. But if there are {{same block group}} entry exists in different queue can cause the error situation described in the jira. I've tried an attempt to fix this situation. [~szetszwo] Could you please review the analysis and the proposed fix. Thanks! > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Reporter: Tsz Wo Nicholas Sze >Assignee: Rakesh R > Attachments: HDFS-9822-001.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)