[ 
https://issues.apache.org/jira/browse/HADOOP-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308929#comment-15308929
 ] 

Chris Nauroth commented on HADOOP-13219:
----------------------------------------

Do you happen to know what kind of exception it was that caused the threads to 
crash?

Catching {{Throwable}} can be problematic.  Let's assume it was an 
{{OutOfMemoryError}}.  If there was a failure to allocate memory, and we catch 
the error and proceed, how do we understand what state the process is in 
currently?  What if we made partial updates to in-memory state?  Since 
{{OutOfMemoryError}} can be thrown by nearly anything, we effectively have no 
idea what state we're in at this point.  For the NameNode, the inode tree might 
be in an unusual state, and not reflected back to persistent store in fsimage 
or edit log transactions.

There is already a catch of {{OutOfMemoryError}} at another layer in the RPC 
client.  It's a bit of code I disagree with.  Some of us choose to run the 
NameNode JVM with {{-XX:OnOutOfMemoryError}} set to a command to 
self-terminate.  That's a choice that favors correctness over robustness.

> NameNode Rpc Reader Thread crash, and cluster hang.
> ---------------------------------------------------
>
>                 Key: HADOOP-13219
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13219
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: rpc-server
>    Affects Versions: 2.5.0, 2.6.0, 2.8.0, 2.7.2, 2.6.2, 2.6.4
>            Reporter: ChenFolin
>              Labels: patch
>         Attachments: HADOOP-13219-3.patch, HDFS-10472-2.patch, 
> HDFS-10472.patch
>
>
> My Cluster hang yesterday .
> Becuase the rpc server Reader threads crash. So all rpc request  timeout, 
> include datanode hearbeat &.
> We can see , the method doRunLoop just catch InterruptedException and 
> IOException:
> while (running) {
>           SelectionKey key = null;
>           try {
>             // consume as many connections as currently queued to avoid
>             // unbridled acceptance of connections that starves the select
>             int size = pendingConnections.size();
>             for (int i=size; i>0; i--) {
>               Connection conn = pendingConnections.take();
>               conn.channel.register(readSelector, SelectionKey.OP_READ, conn);
>             }
>             readSelector.select();
>             Iterator<SelectionKey> iter = 
> readSelector.selectedKeys().iterator();
>             while (iter.hasNext()) {
>               key = iter.next();
>               iter.remove();
>               if (key.isValid()) {
>                 if (key.isReadable()) {
>                   doRead(key);
>                 }
>               }
>               key = null;
>             }
>           } catch (InterruptedException e) {
>             if (running) {                      // unexpected -- log it
>               LOG.info(Thread.currentThread().getName() + " unexpectedly 
> interrupted", e);
>             }
>           } catch (IOException ex) {
>             LOG.error("Error in Reader", ex);
>           } 
>         }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to