[jira] [Updated] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period
[ https://issues.apache.org/jira/browse/HDFS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-9826: -- Resolution: Not A Problem Status: Resolved (was: Patch Available) Resolving per above, feel free to reopen if you'd like to resume this work. > Erasure Coding: Postpone the recovery work for a configurable time period > -- > > Key: HDFS-9826 > URL: https://issues.apache.org/jira/browse/HDFS-9826 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-9826-001.patch, HDFS-9826-002.patch > > > Currently NameNode prepares recovering when finding an under replicated > block group. This is inefficient and reduces resources for other operations. > It would be better to postpone the recovery work for a period of time if only > one internal block is corrupted considering points shown by papers such as > \[1\]\[2\]: > 1.Transient errors in which no data are lost account for more than 90% of > data center failures, owing to network partitions, software problems, or > non-disk hardware faults. > 2.Although erasure codes tolerate multiple simultaneous failures, single > failures represent 99.75% of recoveries. > Different clusters may have different status, so we should allow user to > configure the time for postponing the recoveries. Proper configuration will > reduce a large proportion of unnecessary recoveries. When finding multiple > internal blocks corrupted in a block group, we prepare the recovery work > immediately because it’s very rare and we don’t want to increase the risk of > losing data. > [1] Availability in globally distributed storage systems > http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf > [2] Rethinking erasure codes for cloud file systems: minimizing I/O for > recovery and degraded reads > http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period
[ https://issues.apache.org/jira/browse/HDFS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9826: Status: Patch Available (was: Open) > Erasure Coding: Postpone the recovery work for a configurable time period > -- > > Key: HDFS-9826 > URL: https://issues.apache.org/jira/browse/HDFS-9826 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-9826-001.patch, HDFS-9826-002.patch > > > Currently NameNode prepares recovering when finding an under replicated > block group. This is inefficient and reduces resources for other operations. > It would be better to postpone the recovery work for a period of time if only > one internal block is corrupted considering points shown by papers such as > \[1\]\[2\]: > 1.Transient errors in which no data are lost account for more than 90% of > data center failures, owing to network partitions, software problems, or > non-disk hardware faults. > 2.Although erasure codes tolerate multiple simultaneous failures, single > failures represent 99.75% of recoveries. > Different clusters may have different status, so we should allow user to > configure the time for postponing the recoveries. Proper configuration will > reduce a large proportion of unnecessary recoveries. When finding multiple > internal blocks corrupted in a block group, we prepare the recovery work > immediately because it’s very rare and we don’t want to increase the risk of > losing data. > [1] Availability in globally distributed storage systems > http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf > [2] Rethinking erasure codes for cloud file systems: minimizing I/O for > recovery and degraded reads > http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period
[ https://issues.apache.org/jira/browse/HDFS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9826: Attachment: HDFS-9826-002.patch > Erasure Coding: Postpone the recovery work for a configurable time period > -- > > Key: HDFS-9826 > URL: https://issues.apache.org/jira/browse/HDFS-9826 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-9826-001.patch, HDFS-9826-002.patch > > > Currently NameNode prepares recovering when finding an under replicated > block group. This is inefficient and reduces resources for other operations. > It would be better to postpone the recovery work for a period of time if only > one internal block is corrupted considering points shown by papers such as > \[1\]\[2\]: > 1.Transient errors in which no data are lost account for more than 90% of > data center failures, owing to network partitions, software problems, or > non-disk hardware faults. > 2.Although erasure codes tolerate multiple simultaneous failures, single > failures represent 99.75% of recoveries. > Different clusters may have different status, so we should allow user to > configure the time for postponing the recoveries. Proper configuration will > reduce a large proportion of unnecessary recoveries. When finding multiple > internal blocks corrupted in a block group, we prepare the recovery work > immediately because it’s very rare and we don’t want to increase the risk of > losing data. > [1] Availability in globally distributed storage systems > http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf > [2] Rethinking erasure codes for cloud file systems: minimizing I/O for > recovery and degraded reads > http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period
[ https://issues.apache.org/jira/browse/HDFS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9826: Attachment: HDFS-9826-001.patch upload an initial patch without test > Erasure Coding: Postpone the recovery work for a configurable time period > -- > > Key: HDFS-9826 > URL: https://issues.apache.org/jira/browse/HDFS-9826 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > Attachments: HDFS-9826-001.patch > > > Currently NameNode prepares recovering when finding an under replicated > block group. This is inefficient and reduces resources for other operations. > It would be better to postpone the recovery work for a period of time if only > one internal block is corrupted considering points shown by papers such as > \[1\]\[2\]: > 1.Transient errors in which no data are lost account for more than 90% of > data center failures, owing to network partitions, software problems, or > non-disk hardware faults. > 2.Although erasure codes tolerate multiple simultaneous failures, single > failures represent 99.75% of recoveries. > Different clusters may have different status, so we should allow user to > configure the time for postponing the recoveries. Proper configuration will > reduce a large proportion of unnecessary recoveries. When finding multiple > internal blocks corrupted in a block group, we prepare the recovery work > immediately because it’s very rare and we don’t want to increase the risk of > losing data. > [1] Availability in globally distributed storage systems > http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf > [2] Rethinking erasure codes for cloud file systems: minimizing I/O for > recovery and degraded reads > http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period
[ https://issues.apache.org/jira/browse/HDFS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Bo updated HDFS-9826: Description: Currently NameNode prepares recovering when finding an under replicated block group. This is inefficient and reduces resources for other operations. It would be better to postpone the recovery work for a period of time if only one internal block is corrupted considering points shown by papers such as \[1\]\[2\]: 1. Transient errors in which no data are lost account for more than 90% of data center failures, owing to network partitions, software problems, or non-disk hardware faults. 2. Although erasure codes tolerate multiple simultaneous failures, single failures represent 99.75% of recoveries. Different clusters may have different status, so we should allow user to configure the time for postponing the recoveries. Proper configuration will reduce a large proportion of unnecessary recoveries. When finding multiple internal blocks corrupted in a block group, we prepare the recovery work immediately because it’s very rare and we don’t want to increase the risk of losing data. [1] Availability in globally distributed storage systems http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf [2] Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf was: Currently NameNode prepares recovering when finding an under replicated block group. This is inefficient and reduces resources for other operations. It would be better to postpone the recovery work for a period of time if only one internal block is corrupted considering points shown by papers such as \[1\]\[2\]: 1. Transient errors in which no data are lost account for more than 90% of data center failures, owing to network partitions, software problems, or non-disk hardware faults. 2. Although erasure codes tolerate multiple simultaneous failures, single failures represent 99.75% of recoveries. Different clusters may have different status, so we should allow user to configure the time for postponing the recoveries. Proper configuration will reduce a large proportion of unnecessary recoveries. When finding multiple internal blocks corrupted in a block group, we do the recovery work immediately because it’s very rare and we don’t want to increase the risk of losing data. [1] Availability in globally distributed storage systems http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf [2] Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf > Erasure Coding: Postpone the recovery work for a configurable time period > -- > > Key: HDFS-9826 > URL: https://issues.apache.org/jira/browse/HDFS-9826 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Li Bo >Assignee: Li Bo > > Currently NameNode prepares recovering when finding an under replicated > block group. This is inefficient and reduces resources for other operations. > It would be better to postpone the recovery work for a period of time if only > one internal block is corrupted considering points shown by papers such as > \[1\]\[2\]: > 1.Transient errors in which no data are lost account for more than 90% of > data center failures, owing to network partitions, software problems, or > non-disk hardware faults. > 2.Although erasure codes tolerate multiple simultaneous failures, single > failures represent 99.75% of recoveries. > Different clusters may have different status, so we should allow user to > configure the time for postponing the recoveries. Proper configuration will > reduce a large proportion of unnecessary recoveries. When finding multiple > internal blocks corrupted in a block group, we prepare the recovery work > immediately because it’s very rare and we don’t want to increase the risk of > losing data. > [1] Availability in globally distributed storage systems > http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf > [2] Rethinking erasure codes for cloud file systems: minimizing I/O for > recovery and degraded reads > http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)