[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2018-09-08 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16607968#comment-16607968
 ] 

Hadoop QA commented on HDFS-12136:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  5s{color} 
| {color:red} HDFS-12136 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-12136 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12877142/HDFS-12136.trunk.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/25007/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2018-05-07 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466776#comment-16466776
 ] 

Junping Du commented on HDFS-12136:
---

looks like HDFS-11187 may not cover all fix here. [~jojochuang], do you have 
further comments here?
Move it to 2.8.5 as we need more discussion and 2.8.4 is in RC stage.

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2018-04-30 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459058#comment-16459058
 ] 

Daryn Sharp commented on HDFS-12136:


Did a cursory scan of HDFS-11187 and I'm not sure if it supersedes this one.  
We're still running with my patch internally.  In addition to eliminating i/o 
in the lock, this patch tweaked the volume scanner to intelligently handle 
checksum errors due to genstamp updates.  I didn't look that hard into the 
other patch.  Is that scenario an impossible condition now?  

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2018-04-30 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459023#comment-16459023
 ] 

Daryn Sharp commented on HDFS-12136:


I'll study other Jira to see if it's a dup.

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2018-04-28 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457443#comment-16457443
 ] 

genericqa commented on HDFS-12136:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  6s{color} 
| {color:red} HDFS-12136 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-12136 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12877142/HDFS-12136.trunk.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/24104/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2018-04-28 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457439#comment-16457439
 ] 

Junping Du commented on HDFS-12136:
---

Thanks [~jojochuang] for comments. I agree with you that previous lock issue 
get resolved in HDFS-11187. [~daryn], I will go ahead to resolve this as 
duplicated of HDFS-11187 if you also agree.

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2018-04-13 Thread Wei-Chiu Chuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437193#comment-16437193
 ] 

Wei-Chiu Chuang commented on HDFS-12136:


I believe this issue is no more after HDFS-11187 (fixed in 2.8.4)

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2018-04-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437109#comment-16437109
 ] 

Junping Du commented on HDFS-12136:
---

Hi [~daryn], I saw this issue has been quiet for a while. Can we move to next 
release given 2.8.4 is on releasing track?

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2018-02-06 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354973#comment-16354973
 ] 

genericqa commented on HDFS-12136:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  5s{color} 
| {color:red} HDFS-12136 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-12136 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12877142/HDFS-12136.trunk.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/22971/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2017-08-29 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16145433#comment-16145433
 ] 

Brahma Reddy Battula commented on HDFS-12136:
-

bq.so we could target 2.8.3 for the fix.
updated as target version.

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2017-08-25 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142118#comment-16142118
 ] 

Kihwal Lee commented on HDFS-12136:
---

I think the performance impact is less severe after HDFS-12157, so we could 
target 2.8.3 for the fix. 

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2017-07-22 Thread Wei-Chiu Chuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097190#comment-16097190
 ] 

Wei-Chiu Chuang commented on HDFS-12136:


HDFS-6804 is another facet of the same race condition.

[~brahmareddy] made a unit test that reliably reproduces the bug if HDFS-11160 
is reverted.

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2017-07-21 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16096405#comment-16096405
 ] 

Daryn Sharp commented on HDFS-12136:


[~jojochuang], do you have a unit test that can reliably expose the issue?  We 
run lots of Spark apps and didn't see the problem in 2.7, or maybe it happens 
so infrequently that nobody notices.

The basic problem is that doing IO in the lock is completely unacceptable.  
Basic things like a failing drive, a heavily utilized drive, high replication 
from decommissioning or a lost node, will jam up the DN.  Just one slow drive 
will ruin the DN.

Here's how that becomes catastrophic:  Under sufficiently high load, DNs 
congest with BlockSenders serially computing checksums.  Start decomming, 
heartbeat thread receives replication commands.  Instantiating BlockSenders 
contend with the backlogged xceiver threads.  Heartbeats are now delayed.  
Meanwhile, clients timeout and pipelines collapse after 45s.  Clients 
reconstruct pipeline but the prior xceivers may are blocked creating the 
BlockSender – not knowing the client disconnected.  Blocked threads go up, 
eventually hitting the xceiver thread limit (4k for us).  Surprisingly the 
heartbeat thread can eventually receive such little share of the lock that 
10mins elapse and the node goes "dead".  Now the replication monitor issues 
even more replications, causing even more congestion in other nodes.  In a real 
incident, we had ~6 random nodes flapping for hours causing sporadic missing 
blocks and jobs were taking forever.

Kihwal and I think we might know a few approaches to further fix the original 
issue.  However at this time I'd argue that undoing the cluster damaging 
performance is more important than completely fixing the reader/append race.



> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2017-07-18 Thread Wei-Chiu Chuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091766#comment-16091766
 ] 

Wei-Chiu Chuang commented on HDFS-12136:


Hi [~daryn] sorry come to this late. Thanks for the patch and thanks [~kihwal] 
for the stack trace. No doubt this is a serious performance regression.

I want to emphasis that the initial (false positive) corruption is due to a 
race condition between concurrent readers and writers. While HDFS-11160 made it 
seems like it only happen to VolumeScanner, the thing is that this can happen 
to any readers. When reader thinks it gets a checksum corruption, it reports to 
NN, which removes the block replica. This happens very frequent for a Spark 
Streaming application, and data is being read in real-time while data is being 
ingested.

If you want to go this route, please add the same check at dfsclient reader 
side. For example, when it receives a checksum error, read again to weed out 
false false positive caused by race condition.

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2017-07-14 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088124#comment-16088124
 ] 

Kihwal Lee commented on HDFS-12136:
---

[~jojochuang], we started seeing significant performance regression after 
increased I/O activities. Jstacking has revealed that DataXceiver threads are 
all waiting for the dataset impl lock. When the I/O load is reasonable, this 
might not be visible.

{noformat}
"org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer@61a9d939" #351184
 daemon prio=5 os_prio=0 tid=0x7f94ddf0a000 nid=0xafef waiting on condition 
[0x7f94c1d4f000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xd55efd28> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at 
org.apache.hadoop.hdfs.InstrumentedLock.lock(InstrumentedLock.java:102)
at 
org.apache.hadoop.util.AutoCloseableLock.acquire(AutoCloseableLock.java:67)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.acquireDatasetLock(FsDatasetImpl.java:3274)
at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:252)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2348)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- None
{noformat}
{noformat}
"DataXceiver for client DFSClient_xxx [Sending block xxx]" #351183
 daemon prio=5 os_prio=0 tid=0x0409b000 nid=0xafee waiting on condition 
[0x7f94c9f49000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xd55efd28> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at 
org.apache.hadoop.hdfs.InstrumentedLock.lock(InstrumentedLock.java:102)
at 
org.apache.hadoop.util.AutoCloseableLock.acquire(AutoCloseableLock.java:67)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.acquireDatasetLock(FsDatasetImpl.java:3274)
at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:252)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:580)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- None
{noformat}

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest 

[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2017-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087778#comment-16087778
 ] 

Hadoop QA commented on HDFS-12136:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
14s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
39s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs in trunk has 10 extant 
Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 46s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
20s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 93m 15s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure010 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:14b5c93 |
| JIRA Issue | HDFS-12136 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12877142/HDFS-12136.trunk.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 2376c0df5944 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 
12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 75c0220 |
| Default Java | 1.8.0_131 |
| findbugs | v3.1.0-RC1 |
| findbugs | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20277/artifact/patchprocess/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html
 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20277/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20277/testReport/ |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20277/console |
| Powered by | Apache Yetus 0.6.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> BlockSender performance regression due to volume scanner 

[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2017-07-14 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087608#comment-16087608
 ] 

Daryn Sharp commented on HDFS-12136:


The failed tests either aren't failing for me, or are failing with or w/o this 
patch.  The volume failure tests are very flaky.  Apparently races let volume 
references leak so the cluster can't shutdown and timeouts occur.  Will kick 
the build again to cross-compare test failures.

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2017-07-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086435#comment-16086435
 ] 

Hadoop QA commented on HDFS-12136:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
19s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
39s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs in trunk has 10 extant 
Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
38s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 82m 31s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
25s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}108m 39s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency |
|   | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks |
|   | hadoop.hdfs.TestDFSStripedInputStreamWithRandomECPolicy |
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailureWithRandomECPolicy |
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure |
| Timed out junit tests | 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockStatsMXBean |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:14b5c93 |
| JIRA Issue | HDFS-12136 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12877142/HDFS-12136.trunk.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux f92003b5745c 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 
14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 945c095 |
| Default Java | 1.8.0_131 |
| findbugs | v3.1.0-RC1 |
| findbugs | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20263/artifact/patchprocess/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html
 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20263/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 

[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2017-07-13 Thread Wei-Chiu Chuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086236#comment-16086236
 ] 

Wei-Chiu Chuang commented on HDFS-12136:


For more context, there is a Spark Streaming use case that would see very 
frequent HDFS file corruption without HDFS-11160 (several corrupt files in a 
day). So I would hope you are not considering reverting HDFS-11160.

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2017-07-13 Thread Wei-Chiu Chuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086233#comment-16086233
 ] 

Wei-Chiu Chuang commented on HDFS-12136:


Hi [~daryn] thanks for reporting this.

I am aware that the fix HDFS-11160 could potential have the problem that you 
described. But choosing between correctness and performance, I would choose the 
former. That said,  HDFS-11187 is a possible optimization, perhaps you can 
consider it? I did not push HDFS-11187 forward, because HDFS-11160 has been 
running at our end internally in the past few months,  has been distributed in 
a number of CDH releases, and we've not yet received reports regarding 
performance anomaly.

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org