[
https://issues.apache.org/jira/browse/HADOOP-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308929#comment-15308929
]
Chris Nauroth commented on HADOOP-13219:
----------------------------------------
Do you happen to know what kind of exception it was that caused the threads to
crash?
Catching {{Throwable}} can be problematic. Let's assume it was an
{{OutOfMemoryError}}. If there was a failure to allocate memory, and we catch
the error and proceed, how do we understand what state the process is in
currently? What if we made partial updates to in-memory state? Since
{{OutOfMemoryError}} can be thrown by nearly anything, we effectively have no
idea what state we're in at this point. For the NameNode, the inode tree might
be in an unusual state, and not reflected back to persistent store in fsimage
or edit log transactions.
There is already a catch of {{OutOfMemoryError}} at another layer in the RPC
client. It's a bit of code I disagree with. Some of us choose to run the
NameNode JVM with {{-XX:OnOutOfMemoryError}} set to a command to
self-terminate. That's a choice that favors correctness over robustness.
> NameNode Rpc Reader Thread crash, and cluster hang.
> ---------------------------------------------------
>
> Key: HADOOP-13219
> URL: https://issues.apache.org/jira/browse/HADOOP-13219
> Project: Hadoop Common
> Issue Type: Bug
> Components: rpc-server
> Affects Versions: 2.5.0, 2.6.0, 2.8.0, 2.7.2, 2.6.2, 2.6.4
> Reporter: ChenFolin
> Labels: patch
> Attachments: HADOOP-13219-3.patch, HDFS-10472-2.patch,
> HDFS-10472.patch
>
>
> My Cluster hang yesterday .
> Becuase the rpc server Reader threads crash. So all rpc request timeout,
> include datanode hearbeat &.
> We can see , the method doRunLoop just catch InterruptedException and
> IOException:
> while (running) {
> SelectionKey key = null;
> try {
> // consume as many connections as currently queued to avoid
> // unbridled acceptance of connections that starves the select
> int size = pendingConnections.size();
> for (int i=size; i>0; i--) {
> Connection conn = pendingConnections.take();
> conn.channel.register(readSelector, SelectionKey.OP_READ, conn);
> }
> readSelector.select();
> Iterator<SelectionKey> iter =
> readSelector.selectedKeys().iterator();
> while (iter.hasNext()) {
> key = iter.next();
> iter.remove();
> if (key.isValid()) {
> if (key.isReadable()) {
> doRead(key);
> }
> }
> key = null;
> }
> } catch (InterruptedException e) {
> if (running) { // unexpected -- log it
> LOG.info(Thread.currentThread().getName() + " unexpectedly
> interrupted", e);
> }
> } catch (IOException ex) {
> LOG.error("Error in Reader", ex);
> }
> }
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]