[
https://issues.apache.org/jira/browse/HDFS-14961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984338#comment-16984338
]
Vinayakumar B commented on HDFS-14961:
--------------------------------------
Thanks [~ayushtkn] for the analysis and the fix.
Fix looks good to me. +1.
There is already a check present in HealthMonitor thread to quitElection when
namenode state found to be OBSERVER.
{code:java}
if (changedState == HAServiceState.OBSERVER) {
elector.quitElection(true);
serviceState = HAServiceState.OBSERVER;
return;
}{code}
But this is an async monitoring happening every 1 second. In case of manual
transition, state can change directly in NameNode. So ZKFC syncs during
monitoring and quits election.
As [~ferhui] suggested, checking for the state before joining the election also
doesn't hurt. Can be added as a separate Improvement Jira as [~ayushtkn]
already said.
{code:java} if(serviceState != HAServiceState.OBSERVER) {
elector.joinElection(targetToData(localTarget));
}{code}
> [SBN read] Prevent ZKFC changing Observer Namenode state
> --------------------------------------------------------
>
> Key: HDFS-14961
> URL: https://issues.apache.org/jira/browse/HDFS-14961
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Íñigo Goiri
> Assignee: Ayush Saxena
> Priority: Major
> Attachments: HDFS-14961-01.patch, HDFS-14961-02.patch,
> HDFS-14961-03.patch, HDFS-14961-04.patch, ZKFC-TEST-14961.patch
>
>
> HDFS-14130 made ZKFC aware of the Observer Namenode and hence allows ZKFC
> running along with the observer NOde.
> The Observer namenode isn't suppose to be part of ZKFC election process.
> But if the Namenode was part of election, before turning into Observer by
> transitionToObserver Command. The ZKFC still sends instruction to the
> Namenode as a result of previous participation and sometimes tend to change
> the state of Observer to Standby.
> This is also the reason for failure in TestDFSZKFailoverController.
> TestDFSZKFailoverController has been consistently failing with a time out
> waiting in testManualFailoverWithDFSHAAdmin(). In particular
> {{waitForHAState(1, HAServiceState.OBSERVER);}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]