[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230601#comment-14230601 ]
Andrew Wang commented on HADOOP-11252: -------------------------------------- Hi Wilfred, thanks for working on this. I want to start by making sure I understand the patch correctly. We're changing the default rpc timeout to be 5min rather than 0. This means that, rather than sending a ping after a read blocks for 60s, we throw an exception after a read blocks for 5 mins. This actually does not involve write timeouts in the SO_SNDTIMEO sense, so it seems misleading to call it a "write timeout". If we get blocked on the socket write, we will still get stuck until the tcp stack bugs out (the tcp_retries2 you've mentioned elsewhere). As [~daryn] points out above, and also on HDFS-4858 by [~atm], we've historically been reticent to change defaults like this because of potential side-effects. I'm not comfortable changing the defaults here either, without sign-off from e.g. [~daryn] who knows the RPC stuff better. So, a few review comments: * Let's rename the config param as Ming recommends above, seems more accurate. Including Ming's unit test would also be great. * Let's keep the default value of this at 0 to preserve current behavior, unless [~daryn] ok's things. * Since getPingInterval is now package-protected, we should also change setPingInterval to package-protected for parity. It's only used in a test. * Need to add the new config key to core-default.xml also, with description. > RPC client write does not time out by default > --------------------------------------------- > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc > Affects Versions: 2.5.0 > Reporter: Wilfred Spiegelenburg > Assignee: Wilfred Spiegelenburg > Priority: Critical > Attachments: HADOOP-11252.patch > > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)