[ 
https://issues.apache.org/jira/browse/HDFS-15024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986478#comment-16986478
 ] 

Chen Liang commented on HDFS-15024:
-----------------------------------

Looks like v001 patch changes sleep time to 0 when there are multiple NNs. I 
guess the downside is that if it is actually a network exception, the client 
may exhaust all the retries too soon, due to 0 internal between retries. I 
don't think this really makes a big difference though. Following [~xkrogen] and 
[~csun]'s comments, I think a general thing here is that maybe we should 
re-consider what "retry" really means under context of SbN read. Specifically 
under this Jira:
1. if reading from standby/observer failed, does it worth sleep then retry 
(e.g. network exception)? or we just hope next NN solves the problem, so 
failover to next NN immediately (which is current v01 patch)
2. but what if accessing Active fails due to network exception, then maybe we 
keep the current behavior of sleep and retry, hoping the sleeps get around 
temporary network issue.

> [SBN read] In FailoverOnNetworkExceptionRetry , Number of NameNodes as a 
> condition of calculation of sleep time
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-15024
>                 URL: https://issues.apache.org/jira/browse/HDFS-15024
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 2.10.0, 3.3.0, 3.2.1
>            Reporter: huhaiyang
>            Priority: Major
>         Attachments: HDFS-15024.001.patch, client_error.log
>
>
> When we enable the ONN , there will be three NN nodes for the client 
> configuration,
> Such as configuration
> <property>
>     <name>dfs.ha.namenodes.ns1</name>
>     <value>nn2,nn3,nn1</value>
> </property>
> Currently, 
> nn2 is in standby state
> nn3 is in observer state 
> nn1 is in active state
> When the user performs an access HDFS operation
> ./bin/hadoop --loglevel debug fs 
> -Ddfs.client.failover.proxy.provider.ns1=org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider
>  -mkdir /user/haiyang1/test8
> You need to request nn1 when you execute the msync method,
> Actually connect nn2 first and failover is required
> In connection nn3 does not meet the requirements, failover needs to be 
> performed, but at this time, failover operation needs to be performed during 
> a period of hibernation
> Finally, it took a period of hibernation to connect the successful request to 
> nn1
> In FailoverOnNetworkExceptionRetry getFailoverOrRetrySleepTime The current 
> default implementation is Sleep time is calculated when more than one 
> failover operation is performed
> I think that the Number of NameNodes as a condition of calculation of sleep 
> time is more reasonable
> That is, in the current test, executing failover on connection nn3 does not 
> need to sleep time to directly connect to the next nn node
> See client_error.log for details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to