[
https://issues.apache.org/jira/browse/HDFS-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359685#comment-15359685
]
James Clampffer commented on HDFS-10441:
----------------------------------------
Thanks for the very detailed feedback Bob. You caught a lot of issues that
slipped my mind + a few I didn't really think about.
As far as bigger issues go (other points are also important to me, just want to
give comments on these before the next patch):
bq. Do we pass information in the data pointer for the NN failover event? If
so, document it in events.h
I think I accidentally left this out. Needs to be fixed, or I'll take it out
of the header and add another jira for that.
bq. rpc_connection.cc: HandleRpcResponse should push req back to the head of
the queue; alternately, don't dequeue it if we got a standby exception.
Need to take a better look at how to best do this. Will fix.
bq. If HandleRpcResponse gets a kStandbyException, will CommsError be called
twice (once in HandleRpcResponse and again in OnRecvComplete)?
Yes, and I haven't figured out why yet, I should be able to track it down
quickly now that I'm more familiar with the RPC retry code. It doesn't seem to
break things but it sure makes reading the logs more confusing than they need
to be.
bq. rpc_engine.cc: let's use both namenodes if servers.size() >= 2 rather than
just bailing out.
I'm torn on this one. As far as I know you'd need to do a bit of hacking to
get more than two nodes working in HA. It seems like it's best to cover the
common case here. How about allowing more than 2 nodes but issuing warnings
that say that your cluster probably isn't going to act as you'd expect based on
the config file and that we discarded the extra nodes?
bq. rpc_engine.h: IsCurrentActive/IsCurrentStandby are dangerous as designed:
they're asking for race conditions as we acquire the lock, check, release the
lock, then take action. Just before we take action, someone else could change
the value
You're totally right here. I didn't want to implement a mechanism to allow
other things to lock the state of that structure for an arbitrary amount of
time (seemed like it would be asking for a deadlock). I think the best fix
here is to make them private to the NANamenodeTracker.
bq. rpc_engine.cc: Remove RpcEngine::Start instead of deprecating it
Yep, I left that in to see if someone could think of a use for a hook that got
called once the IoService was known to be running. If you can't think of any
I'll be happy to get rid of some dead code.
bq. Don't forget to file bugs to handle more than two namenodes.
Sure. I was thinking of holding off until that became widely supported but it
won't hurt to plan ahead.
As a general warning, I don't plan to have much internet access for the next
week. Once I do I'll address the rest of the comments and get a new patch up
that includes the rest of your feedback.
> libhdfs++: HA namenode support
> ------------------------------
>
> Key: HDFS-10441
> URL: https://issues.apache.org/jira/browse/HDFS-10441
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: hdfs-client
> Reporter: James Clampffer
> Assignee: James Clampffer
> Attachments: HDFS-10441.HDFS-8707.000.patch,
> HDFS-10441.HDFS-8707.002.patch, HDFS-10441.HDFS-8707.003.patch,
> HDFS-10441.HDFS-8707.004.patch, HDFS-10441.HDFS-8707.005.patch,
> HDFS-10441.HDFS-8707.006.patch, HDFS-10441.HDFS-8707.007.patch,
> HDFS-10441.HDFS-8707.008.patch, HDFS-10441.HDFS-8707.009.patch,
> HDFS-8707.HDFS-10441.001.patch
>
>
> If a cluster is HA enabled then do proper failover.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]