[jira] [Updated] (HDFS-12136) BlockSender performance regression due to volume scanner edge case
[ https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-12136: --- Resolution: Won't Fix Status: Resolved (was: Patch Available) Resolved by HDFS-11187 > BlockSender performance regression due to volume scanner edge case > -- > > Key: HDFS-12136 > URL: https://issues.apache.org/jira/browse/HDFS-12136 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch > > > HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan > by reading the last checksum of finalized blocks within the {{BlockSender}} > ctor. Unfortunately it's holding the exclusive dataset lock to open and read > the metafile multiple times Block sender instantiation becomes serialized. > Performance completely collapses under heavy disk i/o utilization or high > xceiver activity. Ex. lost node replication, balancing, or decommissioning. > The xceiver threads congest creating block senders and impair the heartbeat > processing that is contending for the same lock. Combined with other lock > contention issues, pipelines break and nodes sporadically go dead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12136) BlockSender performance regression due to volume scanner edge case
[ https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated HDFS-12136: -- Target Version/s: 2.8.6 (was: 2.8.5) > BlockSender performance regression due to volume scanner edge case > -- > > Key: HDFS-12136 > URL: https://issues.apache.org/jira/browse/HDFS-12136 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch > > > HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan > by reading the last checksum of finalized blocks within the {{BlockSender}} > ctor. Unfortunately it's holding the exclusive dataset lock to open and read > the metafile multiple times Block sender instantiation becomes serialized. > Performance completely collapses under heavy disk i/o utilization or high > xceiver activity. Ex. lost node replication, balancing, or decommissioning. > The xceiver threads congest creating block senders and impair the heartbeat > processing that is contending for the same lock. Combined with other lock > contention issues, pipelines break and nodes sporadically go dead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12136) BlockSender performance regression due to volume scanner edge case
[ https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated HDFS-12136: -- Target Version/s: 2.8.5 (was: 2.8.4) > BlockSender performance regression due to volume scanner edge case > -- > > Key: HDFS-12136 > URL: https://issues.apache.org/jira/browse/HDFS-12136 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch > > > HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan > by reading the last checksum of finalized blocks within the {{BlockSender}} > ctor. Unfortunately it's holding the exclusive dataset lock to open and read > the metafile multiple times Block sender instantiation becomes serialized. > Performance completely collapses under heavy disk i/o utilization or high > xceiver activity. Ex. lost node replication, balancing, or decommissioning. > The xceiver threads congest creating block senders and impair the heartbeat > processing that is contending for the same lock. Combined with other lock > contention issues, pipelines break and nodes sporadically go dead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12136) BlockSender performance regression due to volume scanner edge case
[ https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated HDFS-12136: -- Target Version/s: 2.8.4 (was: 2.8.3) > BlockSender performance regression due to volume scanner edge case > -- > > Key: HDFS-12136 > URL: https://issues.apache.org/jira/browse/HDFS-12136 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch > > > HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan > by reading the last checksum of finalized blocks within the {{BlockSender}} > ctor. Unfortunately it's holding the exclusive dataset lock to open and read > the metafile multiple times Block sender instantiation becomes serialized. > Performance completely collapses under heavy disk i/o utilization or high > xceiver activity. Ex. lost node replication, balancing, or decommissioning. > The xceiver threads congest creating block senders and impair the heartbeat > processing that is contending for the same lock. Combined with other lock > contention issues, pipelines break and nodes sporadically go dead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12136) BlockSender performance regression due to volume scanner edge case
[ https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated HDFS-12136: Target Version/s: 2.8.3 (was: 2.8.2) > BlockSender performance regression due to volume scanner edge case > -- > > Key: HDFS-12136 > URL: https://issues.apache.org/jira/browse/HDFS-12136 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch > > > HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan > by reading the last checksum of finalized blocks within the {{BlockSender}} > ctor. Unfortunately it's holding the exclusive dataset lock to open and read > the metafile multiple times Block sender instantiation becomes serialized. > Performance completely collapses under heavy disk i/o utilization or high > xceiver activity. Ex. lost node replication, balancing, or decommissioning. > The xceiver threads congest creating block senders and impair the heartbeat > processing that is contending for the same lock. Combined with other lock > contention issues, pipelines break and nodes sporadically go dead. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12136) BlockSender performance regression due to volume scanner edge case
[ https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated HDFS-12136: --- Attachment: HDFS-12136.trunk.patch HDFS-12136.branch-2.patch Normal serving has been greatly impaired. Appending to a block while it's scanned is exceeding rare compared to the normal block sending rate, yet the fix impacted all serving. There's a bettery way accomplished via: * Entirely remove (revert) fetching of checksums for finalized blocks in the BlockSender ctor. Reduces lock hold time by eliminating i/o in the dataset lock. * If a checksum exception occurs during the scan, and the genstamp changed, mark the block as suspect for rescan. This is the edge case. * Recent suspect blocks considers genstamps. Suspect blocks with a newer genstamp than last recorded are not skipped. * Recent suspects expire 10 min after being added to the cache. Prior behavior was 10 mins after last access - which could lead to indefinite postponement. No test changes needed. {{TestBlockScanner#testAppendWhileScanning}} proves this approach continues to work. Only difference in trunk/branch-2 is context and a few log lines in code copied into a getStoredBlock method. > BlockSender performance regression due to volume scanner edge case > -- > > Key: HDFS-12136 > URL: https://issues.apache.org/jira/browse/HDFS-12136 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch > > > HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan > by reading the last checksum of finalized blocks within the {{BlockSender}} > ctor. Unfortunately it's holding the exclusive dataset lock to open and read > the metafile multiple times Block sender instantiation becomes serialized. > Performance completely collapses under heavy disk i/o utilization or high > xceiver activity. Ex. lost node replication, balancing, or decommissioning. > The xceiver threads congest creating block senders and impair the heartbeat > processing that is contending for the same lock. Combined with other lock > contention issues, pipelines break and nodes sporadically go dead. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12136) BlockSender performance regression due to volume scanner edge case
[ https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated HDFS-12136: --- Status: Patch Available (was: Open) > BlockSender performance regression due to volume scanner edge case > -- > > Key: HDFS-12136 > URL: https://issues.apache.org/jira/browse/HDFS-12136 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch > > > HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan > by reading the last checksum of finalized blocks within the {{BlockSender}} > ctor. Unfortunately it's holding the exclusive dataset lock to open and read > the metafile multiple times Block sender instantiation becomes serialized. > Performance completely collapses under heavy disk i/o utilization or high > xceiver activity. Ex. lost node replication, balancing, or decommissioning. > The xceiver threads congest creating block senders and impair the heartbeat > processing that is contending for the same lock. Combined with other lock > contention issues, pipelines break and nodes sporadically go dead. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org