[
https://issues.apache.org/jira/browse/HDFS-15719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250099#comment-17250099
]
Akira Ajisaka commented on HDFS-15719:
--------------------------------------
FYI: In our environment, we set 60 seconds idle timeout for HttpFS.
In Hadoop 2.x HttpFS, it was Tomcat 6:
https://www.slideshare.net/techblogyahoo/hdfs-migration-from-27-to-33-and-enabling-router-based-federation-rbf-in-production-acah2020/31
> [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket
> timeout
> -------------------------------------------------------------------------------------
>
> Key: HDFS-15719
> URL: https://issues.apache.org/jira/browse/HDFS-15719
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 3.0.0
> Reporter: Wei-Chiu Chuang
> Priority: Critical
> Labels: pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> After Hadoop 3, we migrated Jetty 6 to Jetty 9. It was implemented in
> HADOOP-10075.
> However, HADOOP-10075 erroneously set the HttpServer2 socket idle timeout too
> low.
> We replaced SelectChannelConnector.setLowResourceMaxIdleTime() with
> ServerConnector.setIdleTimeout() but they aren't the same.
> Essentially, the HttpServer2's idle timeout was the default timeout set by
> Jetty 6, which is 200 seconds. After Hadoop 3, the idle timeout is set to 10
> seconds, which is unreasonable for JN. If NameNodes try to download a big
> edit log from JournalNodes (say a few hundred MB), it is likely to exceed 10
> seconds. When it happens, both NN crashes and there's no way to workaround
> unless you apply the patch in HADOOP-15696 to add a config switch for the
> idle timeout. Fortunately, it doesn't happen a lot.
> Propose: bump the idle timeout default to 200 seconds to match the behavior
> in Jetty 6. (Jetty 9 reduces the default idle timeout to 30 seconds, which is
> not suitable for JN)
> Other things to consider:
> 1. fsck serverlet? (somehow I suspect this is related to the socket timeout
> reported in HDFS-7175)
> 2. webhdfs, httpfs? --> we've also received reports that webhdfs can timeout.
> so having a longer timeout makes sense here.
> 2. kms? will the longer timeout cause more lingering sockets?
> Thanks [~zhenshan.wen] for the discussion.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]