[jira] [Commented] (HADOOP-13657) IPC Reader thread could silently die and leave NameNode unresponsive
[ https://issues.apache.org/jira/browse/HADOOP-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526222#comment-15526222 ] Daryn Sharp commented on HADOOP-13657: -- Patch is posted on linked jira. I think it's dangerous to attempt handling unexpected runtime exceptions because the thread may be left in an inconsistent state. I chose to make it fatal per suggestion in description. > IPC Reader thread could silently die and leave NameNode unresponsive > > > Key: HADOOP-13657 > URL: https://issues.apache.org/jira/browse/HADOOP-13657 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Reporter: Zhe Zhang >Priority: Critical > > For each listening port, IPC {{Server#Listener#Reader}} is a single thread in > charge of moving {{Connection}} items from {{pendingConnections}} (capacity > 100) to the {{callQueue}}. > We have experienced an incident where the {{Reader}} thread for HDFS NameNode > died from runtime exception. Then the {{pendingConnections}} queue became > full and the NameNode port became inaccessible. > In our particular case, what killed {{Reader}} was a NPE caused by > https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types > of runtime exceptions could cause this issue as well. > We should add logic to either make the {{Reader}} more robust in case of > runtime exceptions, or at least treat it as a FATAL exception so that > NameNode can fail over to standby, and admins get alerted of the real issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13657) IPC Reader thread could silently die and leave NameNode unresponsive
[ https://issues.apache.org/jira/browse/HADOOP-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524152#comment-15524152 ] Zhe Zhang commented on HADOOP-13657: Thanks [~kihwal]. Linking the issue for now. I think in these 2 issues {{Reader}} died for different reasons, but maybe the solution is similar. I don't have a patch either. > IPC Reader thread could silently die and leave NameNode unresponsive > > > Key: HADOOP-13657 > URL: https://issues.apache.org/jira/browse/HADOOP-13657 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Reporter: Zhe Zhang >Priority: Critical > > For each listening port, IPC {{Server#Listener#Reader}} is a single thread in > charge of moving {{Connection}} items from {{pendingConnections}} (capacity > 100) to the {{callQueue}}. > We have experienced an incident where the {{Reader}} thread for HDFS NameNode > died from runtime exception. Then the {{pendingConnections}} queue became > full and the NameNode port became inaccessible. > In our particular case, what killed {{Reader}} was a NPE caused by > https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types > of runtime exceptions could cause this issue as well. > We should add logic to either make the {{Reader}} more robust in case of > runtime exceptions, or at least treat it as a FATAL exception so that > NameNode can fail over to standby, and admins get alerted of the real issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13657) IPC Reader thread could silently die and leave NameNode unresponsive
[ https://issues.apache.org/jira/browse/HADOOP-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524147#comment-15524147 ] Kihwal Lee commented on HADOOP-13657: - We had reported a similar issue before in HADOOP-11780. It looks like the patch hasn't been posted. [~daryn] says he will post it soon. > IPC Reader thread could silently die and leave NameNode unresponsive > > > Key: HADOOP-13657 > URL: https://issues.apache.org/jira/browse/HADOOP-13657 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Reporter: Zhe Zhang >Priority: Critical > > For each listening port, IPC {{Server#Listener#Reader}} is a single thread in > charge of moving {{Connection}} items from {{pendingConnections}} (capacity > 100) to the {{callQueue}}. > We have experienced an incident where the {{Reader}} thread for HDFS NameNode > died from runtime exception. Then the {{pendingConnections}} queue became > full and the NameNode port became inaccessible. > In our particular case, what killed {{Reader}} was a NPE caused by > https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types > of runtime exceptions could cause this issue as well. > We should add logic to either make the {{Reader}} more robust in case of > runtime exceptions, or at least treat it as a FATAL exception so that > NameNode can fail over to standby, and admins get alerted of the real issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org