[ 
https://issues.apache.org/jira/browse/HDFS-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13600669#comment-13600669
 ] 

Todd Lipcon commented on HDFS-4591:
-----------------------------------

Looks pretty good. Only two small notes:

1) were you able to verify this on a real cluster?
2) in your test, you could use the DelayAnswer utility instead of the manual 
latches, etc, I think? If not, no big deal, but might make it more obvious.
3) can you add to the javadoc of checkOperation that it's recommended that all 
operations be checked once outside the lock (to prevent this bug) and then 
again inside the lock (to make sure that it's race-free)?
                
> HA clients can fail to fail over while Standby NN is performing long 
> checkpoint
> -------------------------------------------------------------------------------
>
>                 Key: HDFS-4591
>                 URL: https://issues.apache.org/jira/browse/HDFS-4591
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha, namenode
>    Affects Versions: 2.0.4-alpha
>            Reporter: Aaron T. Myers
>            Assignee: Aaron T. Myers
>         Attachments: HDFS-4591.patch, HDFS-4591.patch
>
>
> Clients know to fail over to talk to the Active NN when they perform an RPC 
> to the Standby NN and it throws a StandbyException. However, most places in 
> the code that check if the NN is in the standby state do so inside the FSNS 
> fsLock. Since this lock is held for the duration of the saveNamespace during 
> a checkpoint, StandbyExceptions will not be thrown during this time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to