[jira] [Updated] (HDFS-12043) Add counters for block re-replication
[ https://issues.apache.org/jira/browse/HDFS-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-12043: --- Fix Version/s: (was: 3.0.0-alpha4) 3.0.0-beta1 > Add counters for block re-replication > - > > Key: HDFS-12043 > URL: https://issues.apache.org/jira/browse/HDFS-12043 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Liang >Assignee: Chen Liang > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: HDFS-12043.001.patch, HDFS-12043.002.patch, > HDFS-12043.003.patch, HDFS-12043.004.patch, HDFS-12043-branch-2.005.patch > > > We occasionally see that the under-replicated block count is not going down > quickly enough. We've made at least one fix to speed up block replications > (HDFS-9205) but we need better insight into the current state and activity of > the block re-replication logic. For example, we need to understand whether is > it because re-replication is not making forward progress at all, or is it > because new under-replicated blocks are being added faster. > We should include additional metrics: > # Cumulative number of blocks that were successfully replicated. > # Cumulative number of re-replications that timed out. > # Cumulative number of blocks that were dequeued for re-replication but not > scheduled e.g. because they were invalid, or under-construction or > replication was postponed. > > The growth rate of of the above metrics will make it clear whether block > replication is making forward progress and if not then provide potential > clues about why it is stalled. > Thanks [~arpitagarwal] for the offline discussions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12043) Add counters for block re-replication
[ https://issues.apache.org/jira/browse/HDFS-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated HDFS-12043: - Fix Version/s: 2.9.0 Committed to branch-2. Thanks for the backport [~vagarychen]. > Add counters for block re-replication > - > > Key: HDFS-12043 > URL: https://issues.apache.org/jira/browse/HDFS-12043 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Liang >Assignee: Chen Liang > Fix For: 2.9.0, 3.0.0-alpha4 > > Attachments: HDFS-12043.001.patch, HDFS-12043.002.patch, > HDFS-12043.003.patch, HDFS-12043.004.patch, HDFS-12043-branch-2.005.patch > > > We occasionally see that the under-replicated block count is not going down > quickly enough. We've made at least one fix to speed up block replications > (HDFS-9205) but we need better insight into the current state and activity of > the block re-replication logic. For example, we need to understand whether is > it because re-replication is not making forward progress at all, or is it > because new under-replicated blocks are being added faster. > We should include additional metrics: > # Cumulative number of blocks that were successfully replicated. > # Cumulative number of re-replications that timed out. > # Cumulative number of blocks that were dequeued for re-replication but not > scheduled e.g. because they were invalid, or under-construction or > replication was postponed. > > The growth rate of of the above metrics will make it clear whether block > replication is making forward progress and if not then provide potential > clues about why it is stalled. > Thanks [~arpitagarwal] for the offline discussions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12043) Add counters for block re-replication
[ https://issues.apache.org/jira/browse/HDFS-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-12043: -- Attachment: HDFS-12043-branch-2.005.patch Post v005 patch for branch-2 > Add counters for block re-replication > - > > Key: HDFS-12043 > URL: https://issues.apache.org/jira/browse/HDFS-12043 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Liang >Assignee: Chen Liang > Fix For: 3.0.0-alpha4 > > Attachments: HDFS-12043.001.patch, HDFS-12043.002.patch, > HDFS-12043.003.patch, HDFS-12043.004.patch, HDFS-12043-branch-2.005.patch > > > We occasionally see that the under-replicated block count is not going down > quickly enough. We've made at least one fix to speed up block replications > (HDFS-9205) but we need better insight into the current state and activity of > the block re-replication logic. For example, we need to understand whether is > it because re-replication is not making forward progress at all, or is it > because new under-replicated blocks are being added faster. > We should include additional metrics: > # Cumulative number of blocks that were successfully replicated. > # Cumulative number of re-replications that timed out. > # Cumulative number of blocks that were dequeued for re-replication but not > scheduled e.g. because they were invalid, or under-construction or > replication was postponed. > > The growth rate of of the above metrics will make it clear whether block > replication is making forward progress and if not then provide potential > clues about why it is stalled. > Thanks [~arpitagarwal] for the offline discussions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12043) Add counters for block re-replication
[ https://issues.apache.org/jira/browse/HDFS-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated HDFS-12043: - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0-alpha4 Status: Resolved (was: Patch Available) Committed to trunk, thanks for the contribution [~vagarychen]! > Add counters for block re-replication > - > > Key: HDFS-12043 > URL: https://issues.apache.org/jira/browse/HDFS-12043 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Liang >Assignee: Chen Liang > Fix For: 3.0.0-alpha4 > > Attachments: HDFS-12043.001.patch, HDFS-12043.002.patch, > HDFS-12043.003.patch, HDFS-12043.004.patch > > > We occasionally see that the under-replicated block count is not going down > quickly enough. We've made at least one fix to speed up block replications > (HDFS-9205) but we need better insight into the current state and activity of > the block re-replication logic. For example, we need to understand whether is > it because re-replication is not making forward progress at all, or is it > because new under-replicated blocks are being added faster. > We should include additional metrics: > # Cumulative number of blocks that were successfully replicated. > # Cumulative number of re-replications that timed out. > # Cumulative number of blocks that were dequeued for re-replication but not > scheduled e.g. because they were invalid, or under-construction or > replication was postponed. > > The growth rate of of the above metrics will make it clear whether block > replication is making forward progress and if not then provide potential > clues about why it is stalled. > Thanks [~arpitagarwal] for the offline discussions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12043) Add counters for block re-replication
[ https://issues.apache.org/jira/browse/HDFS-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-12043: -- Attachment: HDFS-12043.004.patch Thanks [~arpitagarwal] who pointed out offline that the use of thread.sleep() in the test can be reliable. Post v004 patch to use {{GenericTestUtils.waitFor()}} instead. > Add counters for block re-replication > - > > Key: HDFS-12043 > URL: https://issues.apache.org/jira/browse/HDFS-12043 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-12043.001.patch, HDFS-12043.002.patch, > HDFS-12043.003.patch, HDFS-12043.004.patch > > > We occasionally see that the under-replicated block count is not going down > quickly enough. We've made at least one fix to speed up block replications > (HDFS-9205) but we need better insight into the current state and activity of > the block re-replication logic. For example, we need to understand whether is > it because re-replication is not making forward progress at all, or is it > because new under-replicated blocks are being added faster. > We should include additional metrics: > # Cumulative number of blocks that were successfully replicated. > # Cumulative number of re-replications that timed out. > # Cumulative number of blocks that were dequeued for re-replication but not > scheduled e.g. because they were invalid, or under-construction or > replication was postponed. > > The growth rate of of the above metrics will make it clear whether block > replication is making forward progress and if not then provide potential > clues about why it is stalled. > Thanks [~arpitagarwal] for the offline discussions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12043) Add counters for block re-replication
[ https://issues.apache.org/jira/browse/HDFS-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-12043: -- Attachment: HDFS-12043.003.patch Thanks [~arpitagarwal] for the comments! Post v003 patch to rename the metrics and added to {{if (pendingNum > 0)}} check. > Add counters for block re-replication > - > > Key: HDFS-12043 > URL: https://issues.apache.org/jira/browse/HDFS-12043 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-12043.001.patch, HDFS-12043.002.patch, > HDFS-12043.003.patch > > > We occasionally see that the under-replicated block count is not going down > quickly enough. We've made at least one fix to speed up block replications > (HDFS-9205) but we need better insight into the current state and activity of > the block re-replication logic. For example, we need to understand whether is > it because re-replication is not making forward progress at all, or is it > because new under-replicated blocks are being added faster. > We should include additional metrics: > # Cumulative number of blocks that were successfully replicated. > # Cumulative number of re-replications that timed out. > # Cumulative number of blocks that were dequeued for re-replication but not > scheduled e.g. because they were invalid, or under-construction or > replication was postponed. > > The growth rate of of the above metrics will make it clear whether block > replication is making forward progress and if not then provide potential > clues about why it is stalled. > Thanks [~arpitagarwal] for the offline discussions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12043) Add counters for block re-replication
[ https://issues.apache.org/jira/browse/HDFS-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-12043: -- Attachment: HDFS-12043.002.patch Thanks [~arpitagarwal] for the comments! Post v002 patch. The other comments are addressed. Regarding the {{if (pendingNum > 0)}} branch, my understanding is that this means there are unfinished replication going on, not necessarily failed re-replication. It could still finish successfully, also it may timeout and counted by the other timeout counter. What do you think? Also in v002 patch, changed the place of incrementing timeout re-replication to the place where it gets detected in {{PendingReconstructionBlocks}}'s thread. v001 patch actually delays the increment by calling in {{BlockManager}}'s thread. > Add counters for block re-replication > - > > Key: HDFS-12043 > URL: https://issues.apache.org/jira/browse/HDFS-12043 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-12043.001.patch, HDFS-12043.002.patch > > > We occasionally see that the under-replicated block count is not going down > quickly enough. We've made at least one fix to speed up block replications > (HDFS-9205) but we need better insight into the current state and activity of > the block re-replication logic. For example, we need to understand whether is > it because re-replication is not making forward progress at all, or is it > because new under-replicated blocks are being added faster. > We should include additional metrics: > # Cumulative number of blocks that were successfully replicated. > # Cumulative number of re-replications that timed out. > # Cumulative number of blocks that were dequeued for re-replication but not > scheduled e.g. because they were invalid, or under-construction or > replication was postponed. > > The growth rate of of the above metrics will make it clear whether block > replication is making forward progress and if not then provide potential > clues about why it is stalled. > Thanks [~arpitagarwal] for the offline discussions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12043) Add counters for block re-replication
[ https://issues.apache.org/jira/browse/HDFS-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated HDFS-12043: - Status: Open (was: Patch Available) > Add counters for block re-replication > - > > Key: HDFS-12043 > URL: https://issues.apache.org/jira/browse/HDFS-12043 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-12043.001.patch > > > We occasionally see that the under-replicated block count is not going down > quickly enough. We've made at least one fix to speed up block replications > (HDFS-9205) but we need better insight into the current state and activity of > the block re-replication logic. For example, we need to understand whether is > it because re-replication is not making forward progress at all, or is it > because new under-replicated blocks are being added faster. > We should include additional metrics: > # Cumulative number of blocks that were successfully replicated. > # Cumulative number of re-replications that timed out. > # Cumulative number of blocks that were dequeued for re-replication but not > scheduled e.g. because they were invalid, or under-construction or > replication was postponed. > > The growth rate of of the above metrics will make it clear whether block > replication is making forward progress and if not then provide potential > clues about why it is stalled. > Thanks [~arpitagarwal] for the offline discussions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12043) Add counters for block re-replication
[ https://issues.apache.org/jira/browse/HDFS-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated HDFS-12043: - Status: Patch Available (was: Open) > Add counters for block re-replication > - > > Key: HDFS-12043 > URL: https://issues.apache.org/jira/browse/HDFS-12043 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-12043.001.patch > > > We occasionally see that the under-replicated block count is not going down > quickly enough. We've made at least one fix to speed up block replications > (HDFS-9205) but we need better insight into the current state and activity of > the block re-replication logic. For example, we need to understand whether is > it because re-replication is not making forward progress at all, or is it > because new under-replicated blocks are being added faster. > We should include additional metrics: > # Cumulative number of blocks that were successfully replicated. > # Cumulative number of re-replications that timed out. > # Cumulative number of blocks that were dequeued for re-replication but not > scheduled e.g. because they were invalid, or under-construction or > replication was postponed. > > The growth rate of of the above metrics will make it clear whether block > replication is making forward progress and if not then provide potential > clues about why it is stalled. > Thanks [~arpitagarwal] for the offline discussions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12043) Add counters for block re-replication
[ https://issues.apache.org/jira/browse/HDFS-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-12043: -- Status: Patch Available (was: Open) > Add counters for block re-replication > - > > Key: HDFS-12043 > URL: https://issues.apache.org/jira/browse/HDFS-12043 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-12043.001.patch > > > We occasionally see that the under-replicated block count is not going down > quickly enough. We've made at least one fix to speed up block replications > (HDFS-9205) but we need better insight into the current state and activity of > the block re-replication logic. For example, we need to understand whether is > it because re-replication is not making forward progress at all, or is it > because new under-replicated blocks are being added faster. > We should include additional metrics: > # Cumulative number of blocks that were successfully replicated. > # Cumulative number of re-replications that timed out. > # Cumulative number of blocks that were dequeued for re-replication but not > scheduled e.g. because they were invalid, or under-construction or > replication was postponed. > > The growth rate of of the above metrics will make it clear whether block > replication is making forward progress and if not then provide potential > clues about why it is stalled. > Thanks [~arpitagarwal] for the offline discussions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12043) Add counters for block re-replication
[ https://issues.apache.org/jira/browse/HDFS-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-12043: -- Attachment: HDFS-12043.001.patch Post v001 patch. > Add counters for block re-replication > - > > Key: HDFS-12043 > URL: https://issues.apache.org/jira/browse/HDFS-12043 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Liang >Assignee: Chen Liang > Attachments: HDFS-12043.001.patch > > > We occasionally see that the under-replicated block count is not going down > quickly enough. We've made at least one fix to speed up block replications > (HDFS-9205) but we need better insight into the current state and activity of > the block re-replication logic. For example, we need to understand whether is > it because re-replication is not making forward progress at all, or is it > because new under-replicated blocks are being added faster. > We should include additional metrics: > # Cumulative number of blocks that were successfully replicated. > # Cumulative number of re-replications that timed out. > # Cumulative number of blocks that were dequeued for re-replication but not > scheduled e.g. because they were invalid, or under-construction or > replication was postponed. > > The growth rate of of the above metrics will make it clear whether block > replication is making forward progress and if not then provide potential > clues about why it is stalled. > Thanks [~arpitagarwal] for the offline discussions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org