[
https://issues.apache.org/jira/browse/HDFS-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209221#comment-13209221
]
Aaron T. Myers commented on HDFS-2955:
--------------------------------------
bq. What is the behavior by standby, with this patch, if it has completely read
the last segment and is waiting for the new segment to be completed? I believe
in that case it would anyway return zero.
Not quite. With this patch, if the standby NN has never been in the active
state, the metric will always output 18179, probably because of some oddity
with the way metrics output negative values (since curSegmentTxId is initially
set to HdfsConstants.INVALID_TXID, which is -12345.) This is obviously
incorrect. If the standby NN has previously been in the active state, this
metric will always output 2, which is also incorrect.
bq. We will end up reading from in_progress log for automatic failover to
reduce the failover times.
Maybe. I strongly suspect that the time for automatic failover will be greatly
dominated by the time to detect failure of the active and fence it, not the
time it takes to read the most recent edit log segment once we've decided to
fail over, in which case this optimization of reading in-progress edit logs
will provide little benefit.
Regardless, this isn't how it's implemented now.
bq. This would be one less place to change when standby starts reading from
in_progress.
Except that we should write a test that this metric outputs the correct values,
in which case this code might change anyway. We don't yet know how reading
in-progress edit logs will be implemented.
bq. Regarding testing, any HA test will run into it. I have a 100% hit rate on
the actual cluster
Sure, but none of the tests will _fail_ because of this error, will they?
You'll see an error in the NN log if you look, but only if. And even if tests
were failing without this patch, there's still no test asserting that the
metric outputs the correct value in the case of the standby NN.
> HA: IllegalStateException during standby startup in getCurSegmentTxId
> ---------------------------------------------------------------------
>
> Key: HDFS-2955
> URL: https://issues.apache.org/jira/browse/HDFS-2955
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: ha, name-node
> Affects Versions: HA branch (HDFS-1623)
> Reporter: Hari Mankude
> Assignee: Hari Mankude
> Attachments: HDFS-2955-HDFS-1623.patch, HDFS-2955-HDFS-1623.patch
>
>
> During standby restarts, a new routine getTransactionsSinceLastLogRoll() has
> been introduced for metrics which is calling getCurSegmentTxId().
> checkstate() in getCurSegmentTxId() assumes that log is opened for writing
> and this is not the case in standby.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira