[jira] [Commented] (HADOOP-13657) IPC Reader thread could silently die and leave NameNode unresponsive

2016-09-27 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526222#comment-15526222
 ] 

Daryn Sharp commented on HADOOP-13657:
--

Patch is posted on linked jira.  I think it's dangerous to attempt handling 
unexpected runtime exceptions because the thread may be left in an inconsistent 
state.  I chose to make it fatal per suggestion in description.

> IPC Reader thread could silently die and leave NameNode unresponsive
> 
>
> Key: HADOOP-13657
> URL: https://issues.apache.org/jira/browse/HADOOP-13657
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Reporter: Zhe Zhang
>Priority: Critical
>
> For each listening port, IPC {{Server#Listener#Reader}} is a single thread in 
> charge of moving {{Connection}} items from {{pendingConnections}} (capacity 
> 100) to the {{callQueue}}.
> We have experienced an incident where the {{Reader}} thread for HDFS NameNode 
> died from runtime exception. Then the {{pendingConnections}} queue became 
> full and the NameNode port became inaccessible.
> In our particular case, what killed {{Reader}} was a NPE caused by 
> https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types 
> of runtime exceptions could cause this issue as well.
> We should add logic to either make the {{Reader}} more robust in case of 
> runtime exceptions, or at least treat it as a FATAL exception so that 
> NameNode can fail over to standby, and admins get alerted of the real issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13657) IPC Reader thread could silently die and leave NameNode unresponsive

2016-09-26 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524152#comment-15524152
 ] 

Zhe Zhang commented on HADOOP-13657:


Thanks [~kihwal]. Linking the issue for now. I think in these 2 issues 
{{Reader}} died for different reasons, but maybe the solution is similar. I 
don't have a patch either.

> IPC Reader thread could silently die and leave NameNode unresponsive
> 
>
> Key: HADOOP-13657
> URL: https://issues.apache.org/jira/browse/HADOOP-13657
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Reporter: Zhe Zhang
>Priority: Critical
>
> For each listening port, IPC {{Server#Listener#Reader}} is a single thread in 
> charge of moving {{Connection}} items from {{pendingConnections}} (capacity 
> 100) to the {{callQueue}}.
> We have experienced an incident where the {{Reader}} thread for HDFS NameNode 
> died from runtime exception. Then the {{pendingConnections}} queue became 
> full and the NameNode port became inaccessible.
> In our particular case, what killed {{Reader}} was a NPE caused by 
> https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types 
> of runtime exceptions could cause this issue as well.
> We should add logic to either make the {{Reader}} more robust in case of 
> runtime exceptions, or at least treat it as a FATAL exception so that 
> NameNode can fail over to standby, and admins get alerted of the real issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13657) IPC Reader thread could silently die and leave NameNode unresponsive

2016-09-26 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524147#comment-15524147
 ] 

Kihwal Lee commented on HADOOP-13657:
-

We had reported a similar issue before in HADOOP-11780.  It looks like the 
patch hasn't been posted. [~daryn] says he will post it soon.

> IPC Reader thread could silently die and leave NameNode unresponsive
> 
>
> Key: HADOOP-13657
> URL: https://issues.apache.org/jira/browse/HADOOP-13657
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Reporter: Zhe Zhang
>Priority: Critical
>
> For each listening port, IPC {{Server#Listener#Reader}} is a single thread in 
> charge of moving {{Connection}} items from {{pendingConnections}} (capacity 
> 100) to the {{callQueue}}.
> We have experienced an incident where the {{Reader}} thread for HDFS NameNode 
> died from runtime exception. Then the {{pendingConnections}} queue became 
> full and the NameNode port became inaccessible.
> In our particular case, what killed {{Reader}} was a NPE caused by 
> https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types 
> of runtime exceptions could cause this issue as well.
> We should add logic to either make the {{Reader}} more robust in case of 
> runtime exceptions, or at least treat it as a FATAL exception so that 
> NameNode can fail over to standby, and admins get alerted of the real issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org