[jira] [Work started] (HDFS-15719) [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket timeout

Wei-Chiu Chuang (Jira) Mon, 04 Jan 2021 21:28:06 -0800


     [ 
https://issues.apache.org/jira/browse/HDFS-15719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Work on HDFS-15719 started by Wei-Chiu Chuang.
----------------------------------------------
> [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket 
> timeout
> -------------------------------------------------------------------------------------
>
>                 Key: HDFS-15719
>                 URL: https://issues.apache.org/jira/browse/HDFS-15719
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>            Priority: Critical
>              Labels: pull-request-available
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> After Hadoop 3, we migrated Jetty 6 to Jetty 9. It was implemented in 
> HADOOP-10075.
> However, HADOOP-10075 erroneously set the HttpServer2 socket idle timeout too 
> low.
> We replaced SelectChannelConnector.setLowResourceMaxIdleTime() with 
> ServerConnector.setIdleTimeout() but they aren't the same.
> Essentially, the HttpServer2's idle timeout was the default timeout set by 
> Jetty 6, which is 200 seconds. After Hadoop 3, the idle timeout is set to 10 
> seconds, which is unreasonable for JN. If NameNodes try to download a big 
> edit log from JournalNodes (say a few hundred MB), it is likely to exceed 10 
> seconds. When it happens, both NN crashes and there's no way to workaround 
> unless you apply the patch in HADOOP-15696 to add a config switch for the 
> idle timeout. Fortunately, it doesn't happen a lot.
> Propose: bump the idle timeout default to 200 seconds to match the behavior 
> in Jetty 6. (Jetty 9 reduces the default idle timeout to 30 seconds, which is 
> not suitable for JN)
> Other things to consider:
> 1. fsck serverlet? (somehow I suspect this is related to the socket timeout 
> reported in HDFS-7175)
> 2. webhdfs, httpfs? --> we've also received reports that webhdfs can timeout. 
> so having a longer timeout makes sense here.
> 2. kms? will the longer timeout cause more lingering sockets?
> Thanks [~zhenshan.wen] for the discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work started] (HDFS-15719) [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket timeout

Reply via email to