[
https://issues.apache.org/jira/browse/HDFS-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838401#comment-17838401
]
ASF GitHub Bot commented on HDFS-17477:
---------------------------------------
dannytbecker opened a new pull request, #6748:
URL: https://github.com/apache/hadoop/pull/6748
<!--
Thanks for sending a pull request!
1. If this is your first time, please read our contributor guidelines:
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
2. Make sure your PR title starts with JIRA issue id, e.g.,
'HADOOP-17799. Your PR title ...'.
-->
### Description of PR
#### Summary
[HDFS-17453](https://issues.apache.org/jira/browse/HDFS-17453) fixes a race
condition between IncrementalBlockReports (IBR) and the Edit Log Tailer which
can cause the Standby NameNode (SNN) to incorrectly mark blocks as corrupt when
it transitions to Active. There are a few edge cases that
[HDFS-17453](https://issues.apache.org/jira/browse/HDFS-17453) does not cover.
For Example:
1. SNN1 loads the edits for b1gs1 and b1gs2.
2. DN1 reports b1gs1 to SNN1, so it gets queued for later processing.
3. DN1 reports b1gs2 to SNN1 so it gets added to the blocks map.
4. SNN1 transitions to Active (ANN1).
5. ANN1 processes the pending DN message queue and marks DN1->b1gs1 as
corrupt because it was still in the queue.
#### Changes
Processing a block from a DN-block pair should always remove any queued
messages from the pendingDNMessage queue. This prevents older IBRs from being
leaked and causing corrupt blocks when the standby NN becomes active.
**Before**:
- Process IBR
- If the reported block's genstamp is not future or past, then update the
blocks map
- If the reported block's genstamp is from the future or the past, then
keep only the latest IBR in the pendingDNMessage queue.
**After**:
- Process IBR
- Remove the all queued messages from the reported block-DN pair from the
pendingDNMessage queue.
- If the reported block's genstamp is not future or past, then update the
blocks map.
- If the reported block's genstamp is from the future or the past then
queue it.
### How was this patch tested?
Added unit tests and updated unit tests added in
[HDFS-17453](https://issues.apache.org/jira/browse/HDFS-17453)
### For code changes:
- [X] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [ ] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
> IncrementalBlockReport race condition additional edge cases
> -----------------------------------------------------------
>
> Key: HDFS-17477
> URL: https://issues.apache.org/jira/browse/HDFS-17477
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: auto-failover, ha, namenode
> Affects Versions: 3.3.5, 3.3.4, 3.3.6
> Reporter: Danny Becker
> Assignee: Danny Becker
> Priority: Major
>
> HDFS-17453 fixes a race condition between IncrementalBlockReports (IBR) and
> the Edit Log Tailer which can cause the Standby NameNode (SNN) to incorrectly
> mark blocks as corrupt when it transitions to Active. There are a few edge
> cases that HDFS-17453 does not cover.
> For Example:
> 1. SNN1 loads the edits for b1gs1 and b1gs2.
> 2. DN1 reports b1gs1 to SNN1, so it gets queued for later processing.
> 3. DN1 reports b1gs2 to SNN1 so it gets added to the blocks map.
> 4. SNN1 transitions to Active (ANN1).
> 5. ANN1 processes the pending DN message queue and marks DN1->b1gs1 as
> corrupt because it was still in the queue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]