[
https://issues.apache.org/jira/browse/HDFS-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16788363#comment-16788363
]
Ekanth Sethuramalingam commented on HDFS-14317:
-----------------------------------------------
Good point [~xkrogen]. I missed it too. I am not too worried about this in a
single standby scenario (where I still think the standby would trigger the roll
and the autoroll multiplier provides upper bound protection). However, this
will be a problem with multiple standby situation where each standby would
independently roll edits - this is an undesirable side-effect of this fix.
Please go ahead and file a Jira - we may need to carefully think through this
for cases with both in-progress edit log tailing and otherwise. Probably, a
good time to simplify/clean-up the logic here.
> Standby does not trigger edit log rolling when in-progress edit log tailing
> is enabled
> --------------------------------------------------------------------------------------
>
> Key: HDFS-14317
> URL: https://issues.apache.org/jira/browse/HDFS-14317
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.9.0, 3.0.0
> Reporter: Ekanth Sethuramalingam
> Assignee: Ekanth Sethuramalingam
> Priority: Critical
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-14317.001.patch, HDFS-14317.002.patch,
> HDFS-14317.003.patch, HDFS-14317.004.patch
>
>
> The standby uses the following method to check if it is time to trigger edit
> log rolling on active.
> {code}
> /**
> * @return true if the configured log roll period has elapsed.
> */
> private boolean tooLongSinceLastLoad() {
> return logRollPeriodMs >= 0 &&
> (monotonicNow() - lastLoadTimeMs) > logRollPeriodMs ;
> }
> {code}
> In doTailEdits(), lastLoadTimeMs is updated when standby is able to
> successfully tail any edits
> {code}
> if (editsLoaded > 0) {
> lastLoadTimeMs = monotonicNow();
> }
> {code}
> The default configuration for {{dfs.ha.log-roll.period}} is 120 seconds and
> {{dfs.ha.tail-edits.period}} is 60 seconds. With in-progress edit log tailing
> enabled, tooLongSinceLastLoad() will almost never return true resulting in
> edit logs not rolled for a long time until this configuration
> {{dfs.namenode.edit.log.autoroll.multiplier.threshold}} takes effect.
> [In our deployment, this resulted in in-progress edit logs getting deleted.
> The sequence of events is that standby was able to checkpoint twice while the
> in-progress edit log was growing on active. When the
> NNStorageRetentionManager decided to cleanup old checkpoints and edit logs,
> it cleaned up the in-progress edit log from active and QJM (as the txnid on
> in-progress edit log was older than the 2 most recent checkpoints) resulting
> in irrecoverably losing a few minutes worth of metadata].
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]