[ https://issues.apache.org/jira/browse/HDFS-14963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969795#comment-16969795 ]
Xudong Cao edited comment on HDFS-14963 at 11/8/19 3:27 AM: ------------------------------------------------------------ cc [~shv] [~elgoiri] [~vagarychen] [~weichiu] Thank you all for your attention. For the convenience of reading, I have uploaded an additional patch besides github PR (they are exactly a same patch). Based on this patch: # The cache directory is configurable by a newly introduced item "dfs.client.failover.cache-active.dir", its default value is ${java.io.tmpdir}, which is /tmp on Linux platform. # Writing/Reading a cache file is under file lock protection, and we use trylock() instead of lock(), so in a high-concurrency scenario, reading/writing cache file will not become the bottleneck. if trylock() failed while reading, it just fall back to what we have today: simply return a index of 0. And if trylock() failed while writing, it simply returns and continues. In fact, I think both these situations should be very rare. # All cache files' mode are manually set to "666", meaning every process can read/write them. # This cache mechanism is robust, regardless of whether the cache file was accidentally deleted or the content was maliciously modified, the readActiveCache() always returns a legal index, and writeActiveCache() will automatically rebuild the cache file on next failover. # We surely have dfs.client.failover.random.order, actually I have used it in the unit test, Zkfc does know who is active NN right now, but it does not have an rpc interface allowing us to get it. and I think an rpc call is much more expensive than reading/writing local files. # cc [~xkrogen] , I will then tacle the logging issue discussed in (2) in a separate JIRA. was (Author: xudongcao): cc [~shv] [~elgoiri] [~vagarychen] [~weichiu] Thank you all for your attention. For the convenience of reading, I have uploaded an additional patch besides github PR (they are exactly a same patch). Based on this patch: # The cache directory is configurable by a newly introduced item "dfs.client.failover.cache-active.dir", its default value is ${java.io.tmpdir}, which is /tmp on Linux platform. # Writing/Reading a cache file is under file lock protection, and we use trylock() instead of lock(), so in a high-concurrency scenario, reading/writing cache file will not become the bottleneck. if trylock() failed while reading, it just fall back to what we have today: simply return a index of 0. And if trylock() failed while writing, it simply returns and continues. In fact, I think both these situations should be very rare. # All cache files' mode are manually set to "666", meaning every process can read/write them. # This cache mechanism is robust, regardless of whether the cache file was accidentally deleted or the content was maliciously modified, the readActiveCache() always returns a legal index, and writeActiveCache() will automatically rebuild the cache file on next failover. # We surely have dfs.client.failover.random.order, actually I have used it in the unit test, Zkfc does know who is active NN right now, but it does not have an rpc interface allowing us to get it. and I think an rpc call is much more expensive than reading/writing local files. # cc [~elgoiri], I will then tacle the logging issue discussed in (2) in a separate JIRA. > Add HDFS Client machine caching active namenode index mechanism. > ---------------------------------------------------------------- > > Key: HDFS-14963 > URL: https://issues.apache.org/jira/browse/HDFS-14963 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client > Affects Versions: 3.1.3 > Reporter: Xudong Cao > Assignee: Xudong Cao > Priority: Minor > Attachments: HDFS-14963.000.patch > > > In multi-NameNodes scenery, a new hdfs client always begins a rpc call from > the 1st namenode, simply polls, and finally determines the current Active > namenode. > This brings at least two problems: > # Extra failover consumption, especially in the case of frequent creation of > clients. > # Unnecessary log printing, suppose there are 3 NNs and the 3rd is ANN, and > then a client starts rpc with the 1st NN, it will be silent when failover > from the 1st NN to the 2nd NN, but when failover from the 2nd NN to the 3rd > NN, it prints some unnecessary logs, in some scenarios, these logs will be > very numerous: > {code:java} > 2019-11-07 11:35:41,577 INFO retry.RetryInvocationHandler: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby. Visit > https://s.apache.org/sbnn-error > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:2052) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1459) > ...{code} > We can introduce a solution for this problem: in client machine, for every > hdfs cluster, caching its current Active NameNode index in a separate cache > file named by its uri. *Note these cache files are shared by all hdfs client > processes on this machine*. > For example, suppose there are hdfs://ns1 and hdfs://ns2, and the client > machine cache file directory is /tmp, then: > # the ns1 cluster related cache file is /tmp/ns1 > # the ns2 cluster related cache file is /tmp/ns2 > And then: > # When a client starts, it reads the current Active NameNode index from the > corresponding cache file based on the target hdfs uri, and then directly make > an rpc call toward the right ANN. > # After each time client failovers, it need to write the latest Active > NameNode index to the corresponding cache file based on the target hdfs uri. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org