[jira] [Comment Edited] (HDFS-14963) Add HDFS Client machine caching active namenode index mechanism.

Xudong Cao (Jira) Thu, 07 Nov 2019 19:30:47 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-14963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969795#comment-16969795
 ]


Xudong Cao edited comment on HDFS-14963 at 11/8/19 3:27 AM:
------------------------------------------------------------

cc [~shv] [~elgoiri] [~vagarychen] [~weichiu] Thank you all for your attention. 
For the convenience of reading, I have uploaded an additional patch besides 
github PR (they are exactly a same patch). Based on this patch:
 # The cache directory is configurable by a newly introduced item 
"dfs.client.failover.cache-active.dir",  its default value is 
${java.io.tmpdir}, which is /tmp on Linux platform.
 # Writing/Reading a cache file is under file lock protection, and we use 
trylock() instead of lock(), so in a high-concurrency scenario, reading/writing 
cache file will not become the bottleneck. if trylock() failed while reading, 
it just fall back to what we have today: simply return a index of 0. And if 
trylock() failed while writing, it simply returns and continues. In fact, I 
think both these situations should be very rare.
 # All cache files' mode are manually set to  "666", meaning every process can 
read/write them.
 # This cache mechanism is robust, regardless of whether the cache file was 
accidentally deleted or the content was maliciously modified, the 
readActiveCache() always returns a legal index, and writeActiveCache() will 
automatically rebuild the cache file on next failover.
 # We surely have dfs.client.failover.random.order, actually I have used it in 
the unit test, Zkfc does know who is active NN right now, but it does not have 
an rpc interface allowing us to get it.  and I think an rpc call is much more 
expensive than reading/writing local files.
 # cc [~xkrogen] , I will then tacle the logging issue discussed in (2) in a 
separate JIRA.


was (Author: xudongcao):
cc [~shv] [~elgoiri] [~vagarychen] [~weichiu] Thank you all for your attention. 
For the convenience of reading, I have uploaded an additional patch besides 
github PR (they are exactly a same patch). Based on this patch:
 # The cache directory is configurable by a newly introduced item 
"dfs.client.failover.cache-active.dir",  its default value is 
${java.io.tmpdir}, which is /tmp on Linux platform.
 # Writing/Reading a cache file is under file lock protection, and we use 
trylock() instead of lock(), so in a high-concurrency scenario, reading/writing 
cache file will not become the bottleneck. if trylock() failed while reading, 
it just fall back to what we have today: simply return a index of 0. And if 
trylock() failed while writing, it simply returns and continues. In fact, I 
think both these situations should be very rare.
 # All cache files' mode are manually set to  "666", meaning every process can 
read/write them.
 # This cache mechanism is robust, regardless of whether the cache file was 
accidentally deleted or the content was maliciously modified, the 
readActiveCache() always returns a legal index, and writeActiveCache() will 
automatically rebuild the cache file on next failover.
 # We surely have dfs.client.failover.random.order, actually I have used it in 
the unit test, Zkfc does know who is active NN right now, but it does not have 
an rpc interface allowing us to get it.  and I think an rpc call is much more 
expensive than reading/writing local files.
 # cc [~elgoiri], I will then tacle the logging issue discussed in (2) in a 
separate JIRA.

> Add HDFS Client machine caching active namenode index mechanism.
> ----------------------------------------------------------------
>
>                 Key: HDFS-14963
>                 URL: https://issues.apache.org/jira/browse/HDFS-14963
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>    Affects Versions: 3.1.3
>            Reporter: Xudong Cao
>            Assignee: Xudong Cao
>            Priority: Minor
>         Attachments: HDFS-14963.000.patch
>
>
> In multi-NameNodes scenery, a new hdfs client always begins a rpc call from 
> the 1st namenode, simply polls, and finally determines the current Active 
> namenode. 
> This brings at least two problems：
>  # Extra failover consumption, especially in the case of frequent creation of 
> clients.
>  # Unnecessary log printing, suppose there are 3 NNs and the 3rd is ANN, and 
> then a client starts rpc with the 1st NN, it will be silent when failover 
> from the 1st NN to the 2nd NN, but when failover from the 2nd NN to the 3rd 
> NN, it prints some unnecessary logs, in some scenarios, these logs will be 
> very numerous:
> {code:java}
> 2019-11-07 11:35:41,577 INFO retry.RetryInvocationHandler: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby. Visit 
> https://s.apache.org/sbnn-error
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:2052)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1459)
>  ...{code}
> We can introduce a solution for this problem: in client machine, for every 
> hdfs cluster, caching its current Active NameNode index in a separate cache 
> file named by its uri. *Note these cache files are shared by all hdfs client 
> processes on this machine*.
> For example, suppose there are hdfs://ns1 and hdfs://ns2, and the client 
> machine cache file directory is /tmp, then:
>  # the ns1 cluster related cache file is /tmp/ns1
>  # the ns2 cluster related cache file is /tmp/ns2
> And then:
>  #  When a client starts, it reads the current Active NameNode index from the 
> corresponding cache file based on the target hdfs uri, and then directly make 
> an rpc call toward the right ANN.
>  #  After each time client failovers, it need to write the latest Active 
> NameNode index to the corresponding cache file based on the target hdfs uri.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-14963) Add HDFS Client machine caching active namenode index mechanism.

Reply via email to