[ 
https://issues.apache.org/jira/browse/HDFS-15719?focusedWorklogId=525539&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-525539
 ]

ASF GitHub Bot logged work on HDFS-15719:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 17/Dec/20 12:59
            Start Date: 17/Dec/20 12:59
    Worklog Time Spent: 10m 
      Work Description: sodonnel commented on pull request #2533:
URL: https://github.com/apache/hadoop/pull/2533#issuecomment-747424463


   This change LGTM - +1


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 525539)
    Time Spent: 1h  (was: 50m)

> [Hadoop 3] Both NameNodes can crash simultaneously due to the short JN socket 
> timeout
> -------------------------------------------------------------------------------------
>
>                 Key: HDFS-15719
>                 URL: https://issues.apache.org/jira/browse/HDFS-15719
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>            Priority: Critical
>              Labels: pull-request-available
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> After Hadoop 3, we migrated Jetty 6 to Jetty 9. It was implemented in 
> HADOOP-10075.
> However, HADOOP-10075 erroneously set the HttpServer2 socket idle timeout too 
> low.
> We replaced SelectChannelConnector.setLowResourceMaxIdleTime() with 
> ServerConnector.setIdleTimeout() but they aren't the same.
> Essentially, the HttpServer2's idle timeout was the default timeout set by 
> Jetty 6, which is 200 seconds. After Hadoop 3, the idle timeout is set to 10 
> seconds, which is unreasonable for JN. If NameNodes try to download a big 
> edit log from JournalNodes (say a few hundred MB), it is likely to exceed 10 
> seconds. When it happens, both NN crashes and there's no way to workaround 
> unless you apply the patch in HADOOP-15696 to add a config switch for the 
> idle timeout. Fortunately, it doesn't happen a lot.
> Propose: bump the idle timeout default to 200 seconds to match the behavior 
> in Jetty 6. (Jetty 9 reduces the default idle timeout to 30 seconds, which is 
> not suitable for JN)
> Other things to consider:
> 1. fsck serverlet? (somehow I suspect this is related to the socket timeout 
> reported in HDFS-7175)
> 2. webhdfs, httpfs? --> we've also received reports that webhdfs can timeout. 
> so having a longer timeout makes sense here.
> 2. kms? will the longer timeout cause more lingering sockets?
> Thanks [~zhenshan.wen] for the discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to