[ 
https://issues.apache.org/jira/browse/HDFS-16557?focusedWorklogId=764587&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-764587
 ]

ASF GitHub Bot logged work on HDFS-16557:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 30/Apr/22 02:12
            Start Date: 30/Apr/22 02:12
    Worklog Time Spent: 10m 
      Work Description: tomscut commented on PR #4219:
URL: https://github.com/apache/hadoop/pull/4219#issuecomment-1113892494

   > This seems right to me, but I don't fully understand what went wrong to 
cause the error. Can you explain more fully? Why did we previously make the 
assumption that `INVALID_TXID` meant in-progress, and what has changed to make 
that not true / what happened in your specific scenario to cause that not to be 
true?
   
   Thank you @xkrogen very much  for your review.
   
   After introducing [SBN READ], we updated the configuration: 
`dfs.ha.tail-edits.in-progress=true`.
   
   Then when we `bootstrapStandby`, we will encounter something like this:
   1. We need to start an Observer Namenode, so we execute bootstrapStandby 
before start it. This will automatically pull the latest FSImage from the 
Active Namenode and check whether the edits in the journals has a gap based on 
the `lastTxid` of the FSImage.
   
   2. Assume that the txid of the latest FSImage is x, and editslogs from x in 
journals is in `InProgress` state, `FSEditLog#checkForGaps` will be skipped. 
Because the `lastTxid` of the InProgress EditLogInputStream is not 
`HdfsServerConstants.INVALID_TXID`, but a specific number.  
   
   3. However, between x and txID currently being written, there is finalize 
Edit log, and `bootstrapStandby` can execute normally.
   
   The `lastTxId` of an InProgress EditLogInputStream isn't always as 
`HdfsServerConstants.INVALID_TXID`, could also be a specific number.




Issue Time Tracking
-------------------

    Worklog Id:     (was: 764587)
    Time Spent: 1h 10m  (was: 1h)

> BootstrapStandby failed because of checking gap for inprogress 
> EditLogInputStream
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-16557
>                 URL: https://issues.apache.org/jira/browse/HDFS-16557
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Tao Li
>            Assignee: Tao Li
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2022-04-22-17-17-14-577.png, 
> image-2022-04-22-17-17-14-618.png, image-2022-04-22-17-17-23-113.png, 
> image-2022-04-22-17-17-32-487.png
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The lastTxId of an inprogress EditLogInputStream lastTxId isn't necessarily 
> HdfsServerConstants.INVALID_TXID. We can determine its status directly by 
> EditLogInputStream#isInProgress.
> We introduced [SBN READ], and set 
> {color:#ff0000}{{dfs.ha.tail-edits.in-progress=true}}{color}. Then 
> bootstrapStandby, the EditLogInputStream of inProgress is misjudged, 
> resulting in a gap check failure, which causes bootstrapStandby to fail.
> hdfs namenode -bootstrapStandby
> !image-2022-04-22-17-17-32-487.png|width=766,height=161!
> !image-2022-04-22-17-17-14-577.png|width=598,height=187!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to