[ https://issues.apache.org/jira/browse/HDFS-11960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721678#comment-17721678 ]
ASF GitHub Bot commented on HDFS-11960: --------------------------------------- LiuGuH opened a new pull request, #5642: URL: https://github.com/apache/hadoop/pull/5642 <!-- Thanks for sending a pull request! 1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute 2. Make sure your PR title starts with JIRA issue id, e.g., 'HADOOP-17799. Your PR title ...'. --> ### Description of PR TestPendingReconstruction.testProcessPendingReconstructions() verify [HDFS-11960](https://issues.apache.org/jira/browse/HDFS-11960) is wrong. (1) It does not stop PendingReconstructionMonitor. The blockid will into timeouts queue because of timout duration is 3s. (2) Test blockid should be blk_1_1 with different genstamp. (3) The blk_1_1 should test with the same DatanodeDescriptor ### How was this patch tested? ### For code changes: - [ ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Successfully closed files can stay under-replicated. > ---------------------------------------------------- > > Key: HDFS-11960 > URL: https://issues.apache.org/jira/browse/HDFS-11960 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Critical > Fix For: 2.9.0, 3.0.0-alpha4, 2.8.2 > > Attachments: HDFS-11960-v2.branch-2.txt, HDFS-11960-v2.trunk.txt, > HDFS-11960.patch > > > If a certain set of conditions hold at the time of a file creation, a block > of the file can stay under-replicated. This is because the block is > mistakenly taken out of the under-replicated block queue and never gets > reevaluated. > Re-evaluation can be triggered if > - a replica containing node dies. > - setrep is called > - NN repl queues are reinitialized (NN failover or restart) > If none of these happens, the block stays under-replicated. > Here is how it happens. > 1) A replica is finalized, but the ACK does not reach the upstream in time. > IBR is also delayed. > 2) A close recovery happens, which updates the gen stamp of "healthy" > replicas. > 3) The file is closed with the healthy replicas. It is added to the > replication queue. > 4) A replication is scheduled, so it is added to the pending replication > list. The replication target is picked as the failed node in 1). > 5) The old IBR is finally received for the failed/excluded node. In the > meantime, the replication fails, because there is already a finalized replica > (with older gen stamp) on the node. > 6) The IBR processing removes the block from the pending list, adds it to > corrupt replicas list, and then issues invalidation. Since the block is in > neither replication queue nor pending list, it stays under-replicated. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org