[ https://issues.apache.org/jira/browse/HDFS-15719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wei-Chiu Chuang resolved HDFS-15719. ------------------------------------ Fix Version/s: 3.2.3 3.1.5 3.4.0 3.3.1 Resolution: Fixed > [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket > timeout > ------------------------------------------------------------------------------------- > > Key: HDFS-15719 > URL: https://issues.apache.org/jira/browse/HDFS-15719 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 3.0.0 > Reporter: Wei-Chiu Chuang > Assignee: Wei-Chiu Chuang > Priority: Critical > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3 > > Time Spent: 1.5h > Remaining Estimate: 0h > > After Hadoop 3, we migrated Jetty 6 to Jetty 9. It was implemented in > HADOOP-10075. > However, HADOOP-10075 erroneously set the HttpServer2 socket idle timeout too > low. > We replaced SelectChannelConnector.setLowResourceMaxIdleTime() with > ServerConnector.setIdleTimeout() but they aren't the same. > Essentially, the HttpServer2's idle timeout was the default timeout set by > Jetty 6, which is 200 seconds. After Hadoop 3, the idle timeout is set to 10 > seconds, which is unreasonable for JN. If NameNodes try to download a big > edit log from JournalNodes (say a few hundred MB), it is likely to exceed 10 > seconds. When it happens, both NN crashes and there's no way to workaround > unless you apply the patch in HADOOP-15696 to add a config switch for the > idle timeout. Fortunately, it doesn't happen a lot. > Propose: bump the idle timeout default to 200 seconds to match the behavior > in Jetty 6. (Jetty 9 reduces the default idle timeout to 30 seconds, which is > not suitable for JN) > Other things to consider: > 1. fsck serverlet? (somehow I suspect this is related to the socket timeout > reported in HDFS-7175) > 2. webhdfs, httpfs? --> we've also received reports that webhdfs can timeout. > so having a longer timeout makes sense here. > 2. kms? will the longer timeout cause more lingering sockets? > Thanks [~zhenshan.wen] for the discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org