[
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arun Suresh updated HDFS-7858:
------------------------------
Attachment: HDFS-7858.10.patch
Thanks again for the review [~jingzhao],
Uploading patch addressing your suggestions..
bq. do we need the latch in RequestHedgingInvocationHandler#invoke ?
No necessarily.. just wanted to ensure all requests are started almost at the
same time. But yeah, since the size of the thread pool is equal to the number
of proxies, it should technically start simultaneously… Ive Removed it
w.r.t the requestTimeout. ..
Hmmm.. Agreed, its not really necessary, (But i think we have to doc that if
this is refactored as a general Handler where are not sure of the underlying
Client/Server protocol and assumptions, a bounding timeout would be
good/necessary)
bq. We can use the ExecutionException thrown by callResultFuture.get() to get
the exception thrown by the invocation.
So, if you notice, I have a {{CallResult}} object which is what is actually
returned by classResultFuture.get(). I need this to get name of the proxy which
was successful (so i can key into the targetProxies map). CallResult catches
the exception and sets it as the result.
> Improve HA Namenode Failover detection on the client
> ----------------------------------------------------
>
> Key: HDFS-7858
> URL: https://issues.apache.org/jira/browse/HDFS-7858
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Arun Suresh
> Assignee: Arun Suresh
> Labels: BB2015-05-TBR
> Attachments: HDFS-7858.1.patch, HDFS-7858.10.patch,
> HDFS-7858.2.patch, HDFS-7858.2.patch, HDFS-7858.3.patch, HDFS-7858.4.patch,
> HDFS-7858.5.patch, HDFS-7858.6.patch, HDFS-7858.7.patch, HDFS-7858.8.patch,
> HDFS-7858.9.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the
> Active and Standby Namenodes.Clients will first try one of the NNs
> (non-deterministically) and if its a standby NN, then it will respond to the
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is
> undergoing some GC / is busy, then those clients might not get a response
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients
> could talk to ZK and find out which is the active namenode before contacting
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when
> there is a failover so they do not have to query ZK everytime to find out the
> active NN
> 2) Clients can also cache the last active NN in the user's home directory
> (~/.lastNN) so that short-lived clients can try that Namenode first before
> querying ZK
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)