[
https://issues.apache.org/jira/browse/HDFS-15719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang resolved HDFS-15719.
------------------------------------
Fix Version/s: 3.2.3
3.1.5
3.4.0
3.3.1
Resolution: Fixed
> [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket
> timeout
> -------------------------------------------------------------------------------------
>
> Key: HDFS-15719
> URL: https://issues.apache.org/jira/browse/HDFS-15719
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 3.0.0
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
> Priority: Critical
> Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> After Hadoop 3, we migrated Jetty 6 to Jetty 9. It was implemented in
> HADOOP-10075.
> However, HADOOP-10075 erroneously set the HttpServer2 socket idle timeout too
> low.
> We replaced SelectChannelConnector.setLowResourceMaxIdleTime() with
> ServerConnector.setIdleTimeout() but they aren't the same.
> Essentially, the HttpServer2's idle timeout was the default timeout set by
> Jetty 6, which is 200 seconds. After Hadoop 3, the idle timeout is set to 10
> seconds, which is unreasonable for JN. If NameNodes try to download a big
> edit log from JournalNodes (say a few hundred MB), it is likely to exceed 10
> seconds. When it happens, both NN crashes and there's no way to workaround
> unless you apply the patch in HADOOP-15696 to add a config switch for the
> idle timeout. Fortunately, it doesn't happen a lot.
> Propose: bump the idle timeout default to 200 seconds to match the behavior
> in Jetty 6. (Jetty 9 reduces the default idle timeout to 30 seconds, which is
> not suitable for JN)
> Other things to consider:
> 1. fsck serverlet? (somehow I suspect this is related to the socket timeout
> reported in HDFS-7175)
> 2. webhdfs, httpfs? --> we've also received reports that webhdfs can timeout.
> so having a longer timeout makes sense here.
> 2. kms? will the longer timeout cause more lingering sockets?
> Thanks [~zhenshan.wen] for the discussion.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]