[ 
https://issues.apache.org/jira/browse/HDFS-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209221#comment-13209221
 ] 

Aaron T. Myers commented on HDFS-2955:
--------------------------------------

bq. What is the behavior by standby, with this patch, if it has completely read 
the last segment and is waiting for the new segment to be completed? I believe 
in that case it would anyway return zero.

Not quite. With this patch, if the standby NN has never been in the active 
state, the metric will always output 18179, probably because of some oddity 
with the way metrics output negative values (since curSegmentTxId is initially 
set to HdfsConstants.INVALID_TXID, which is -12345.) This is obviously 
incorrect. If the standby NN has previously been in the active state, this 
metric will always output 2, which is also incorrect.

bq. We will end up reading from in_progress log for automatic failover to 
reduce the failover times.

Maybe. I strongly suspect that the time for automatic failover will be greatly 
dominated by the time to detect failure of the active and fence it, not the 
time it takes to read the most recent edit log segment once we've decided to 
fail over, in which case this optimization of reading in-progress edit logs 
will provide little benefit.

Regardless, this isn't how it's implemented now.

bq. This would be one less place to change when standby starts reading from 
in_progress.

Except that we should write a test that this metric outputs the correct values, 
in which case this code might change anyway. We don't yet know how reading 
in-progress edit logs will be implemented.

bq. Regarding testing, any HA test will run into it. I have a 100% hit rate on 
the actual cluster

Sure, but none of the tests will _fail_ because of this error, will they? 
You'll see an error in the NN log if you look, but only if. And even if tests 
were failing without this patch, there's still no test asserting that the 
metric outputs the correct value in the case of the standby NN.
                
> HA: IllegalStateException during standby startup in getCurSegmentTxId
> ---------------------------------------------------------------------
>
>                 Key: HDFS-2955
>                 URL: https://issues.apache.org/jira/browse/HDFS-2955
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha, name-node
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Hari Mankude
>            Assignee: Hari Mankude
>         Attachments: HDFS-2955-HDFS-1623.patch, HDFS-2955-HDFS-1623.patch
>
>
> During standby restarts, a new routine getTransactionsSinceLastLogRoll() has 
> been introduced for metrics which is calling getCurSegmentTxId(). 
> checkstate() in getCurSegmentTxId() assumes that log is opened for writing 
> and this is not the case in standby.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to