[ 
https://issues.apache.org/jira/browse/HDFS-13735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555981#comment-16555981
 ] 

Chao Sun commented on HDFS-13735:
---------------------------------

Thanks for taking a look [~shv]! Please see my reply inline.

bq. Can we reuse an existing parameter for this purpose?

I looked around and the only existing configs are 
{{dfs.webhdfs.socket.[connect|read]-timeout}}, which do not seem a good fit for 
this use case.

bq. If we cannot use existing, should we make the new ones public, keep 
undocumented, or use a reasonable hard-coded constant?

I'm more in favor of having config parameters for these. Also there're already 
several timeout related configurations in {{QuorumJournalManager}}, which are 
all exposed through {{hdfs-default.xml}}. Should we do the same for these two 
just for consistency?

bq. If we introduce a new parameter, we should give it a reasonable default 
value. What is the reasonable timeout here? You set it to the old default.

I think it's probably fine to just keep the old default *without 
ObserverNameNode*. For the latter though, the timeout value should be decreased 
for the reasons I listed above. Internally our 5min P99 latency for this is 
roughly below 8sec (this also include the time to apply the edit logs), so to 
me seems like 10 second would be a good value, but obviously it subjects to 
different environments.

bq. The best solution would be to take the http call (readOp()) out of the 
global lock. Can it be done?

Not complete sure. One approach might be to add an {{init}} method 
{{EditLogInputStream}} and call it outside the global lock. This will call the 
{{EditLogFileInputStream#init()}} (see 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java#L134])
 and creates the HTTP connection. Another more hacky way is to call 
{{EditLogInputStream#getVersion(false)}}.

However, one issue is that currently we select input streams inside the lock 
(see 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java#L298])
 so have to move it outside the lock in order to make the above possible.

Even if we can move the http call outside the global lock, the point 2) I 
mentioned above:

bq. The namespace freshness of ObserverNameNode w.r.t Active NN will be as 
stale as 60s because of the long timeout. This may not be acceptable for some 
scenarios.

is still not resolved though.
 

> Make QJM HTTP URL connection timeout configurable
> -------------------------------------------------
>
>                 Key: HDFS-13735
>                 URL: https://issues.apache.org/jira/browse/HDFS-13735
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: qjm
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>            Priority: Minor
>         Attachments: HDFS-13735.000.patch, HDFS-13735.001.patch
>
>
> We've seen "connect timed out" happen internally when QJM tries to open HTTP 
> connections to JNs. This is now using {{newDefaultURLConnectionFactory}} 
> which uses the default timeout 60s, and is not configurable.
> It would be better for this to be configurable, especially for 
> ObserverNameNode (HDFS-12943), where latency is important, and 60s may not be 
> a good value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to