[
https://issues.apache.org/jira/browse/HDFS-15060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Timonin updated HDFS-15060:
----------------------------------
Description:
When I upgrade hadoop to new version (using for ex.
[https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#namenode_-rollingUpgrade]
as instruction) I've got a situation:
I'm upgrading JN's one by one.
# Upgrade and restart JN1
# NN see JN offline:
WARN client.QuorumJournalManager: Remote journal 10.73.67.132:8485 failed to
write txns 1205396-1205399. Will try to write to this JN again after the next
log roll.
# No log roll for some time (at least 1min)
# Upgrade and restart JN2
# NN see it again:
WARN client.QuorumJournalManager: Remote journal 10.73.67.68:8485 failed to
write txns 1205799-1205800. Will try to write to this JN again after the next
log roll.
# BUT! At this time we have no JN quorum:
FATAL namenode.FSEditLog: Error: flush failed for required journal
(JournalAndStream(mgr=QJM to [10.73.67.212:8485, 10.73.67.132:8485,
10.73.67.68:8485], stream=QuorumOutputStream starting at txid 1205246))
although JN1 is online already
It looks like NN should retry JN's marked as offline before giving up.
was:
When I upgrade hadoop to new version (using for ex.
[https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#namenode_-rollingUpgrade]
as instruction) I've got a situation:
I'm upgrading JN's one by one.
# Upgrade and restart JN1
# NN see JN offline: WARN client.QuorumJournalManager: Remote journal
10.73.67.132:8485 failed to write txns 1205396-1205399. Will try to write to
this JN again after the next log roll.
# No log roll for some time (at least 1min)
# Upgrade and restart JN2
# NN see it again: WARN client.QuorumJournalManager: Remote journal
10.73.67.68:8485 failed to write txns 1205799-1205800. Will try to write to
this JN again after the next log roll.
# BUT! At this time we have no JN quorum: FATAL namenode.FSEditLog: Error:
flush failed for required journal (JournalAndStream(mgr=QJM to
[10.73.67.212:8485, 10.73.67.132:8485, 10.73.67.68:8485],
stream=QuorumOutputStream starting at txid 1205246)) although JN1 is online
already
It looks like NN should retry JN's marked as offline before giving up.
> namenode doesn't retry JN when other JN goes down
> -------------------------------------------------
>
> Key: HDFS-15060
> URL: https://issues.apache.org/jira/browse/HDFS-15060
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 3.1.1
> Reporter: Andrew Timonin
> Priority: Minor
>
> When I upgrade hadoop to new version (using for ex.
> [https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#namenode_-rollingUpgrade]
> as instruction) I've got a situation:
> I'm upgrading JN's one by one.
> # Upgrade and restart JN1
> # NN see JN offline:
> WARN client.QuorumJournalManager: Remote journal 10.73.67.132:8485 failed to
> write txns 1205396-1205399. Will try to write to this JN again after the next
> log roll.
> # No log roll for some time (at least 1min)
> # Upgrade and restart JN2
> # NN see it again:
> WARN client.QuorumJournalManager: Remote journal 10.73.67.68:8485 failed to
> write txns 1205799-1205800. Will try to write to this JN again after the next
> log roll.
> # BUT! At this time we have no JN quorum:
> FATAL namenode.FSEditLog: Error: flush failed for required journal
> (JournalAndStream(mgr=QJM to [10.73.67.212:8485, 10.73.67.132:8485,
> 10.73.67.68:8485], stream=QuorumOutputStream starting at txid 1205246))
> although JN1 is online already
> It looks like NN should retry JN's marked as offline before giving up.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]