[ 
https://issues.apache.org/jira/browse/HDFS-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13720360#comment-13720360
 ] 

Andrew Wang commented on HDFS-5016:
-----------------------------------

bq. Can you explain why it is not a good idea?

I think it'll be confusing when this gets printed in the log, since first the 
writer's stack gets WARN logged, then when the exception gets caught and 
printed, we'll see the writer's stack again since it's part of the exception 
msg and then the waiter's stack after that. Kinda spew-y, and it makes it look 
like the exception was thrown from the writer since it comes first when the 
exception is printed.

FWIW, we saw this triggering daily on a customer cluster. Not that common, but 
not that rare either.

bq. you want some thing like this?

Sure, that works.
                
> Heartbeating thread blocks under some failure conditions leading to loss of 
> datanodes
> -------------------------------------------------------------------------------------
>
>                 Key: HDFS-5016
>                 URL: https://issues.apache.org/jira/browse/HDFS-5016
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Suresh Srinivas
>            Priority: Blocker
>             Fix For: 2.1.0-beta
>
>         Attachments: HDFS-5016.1.patch, HDFS-5016.2.patch, HDFS-5016.patch, 
> jstack1.txt
>
>
> In the testing of some failure scenarios for HBase MTTR, we have been 
> simulating node failures via firewalling of nodes (where all communication 
> ports would be firewalled except ssh's port). We have noticed that when a 
> (data)node is firewalled, we lose certain other datanodes - those that were 
> involved in some communication with the firewalled node before the latter was 
> firewalled. Will attach jstack output from one of the lost datanodes. The 
> heartbeating thread seems to be locked up.
> This jira is to track a fix for the problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to