[ 
https://issues.apache.org/jira/browse/HDFS-14500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-14500:
-------------------------------
    Description: 
When testing out a cluster with the edit log tailing fast path feature enabled 
(HDFS-13150), an unrelated issue caused the NameNode to remain in safe mode for 
an extended period of time, preventing the NameNode from fully completing its 
startup sequence. We noticed that the Startup Progress web UI displayed many 
edit log segments (millions of them).

I traced this problem back to {{{}StartupProgress{}}}. Within 
{{{}FSEditLogLoader{}}}, the loader continually tries to update the startup 
progress with a new {{Step}} any time that it loads edits. Per the Javadoc for 
{{{}StartupProgress{}}}, this should be a no-op once startup is completed:
{code:java|title=StartupProgress.java}
 * After startup completes, the tracked data is frozen.  Any subsequent updates
 * or counter increments are no-ops.
{code}
However, {{StartupProgress}} only implements that logic once the _entire_ 
startup sequence has been completed. When {{FSEditLogLoader}} calls 
{{{}addStep(){}}}, it adds it into the {{LOADING_EDITS}} phase:
{code:java|title=FSEditLogLoader.java}
    StartupProgress prog = NameNode.getStartupProgress();
    Step step = createStartupProgressStep(edits);
    prog.beginStep(Phase.LOADING_EDITS, step);
{code}
This phase, in our case, ended long before, so it is nonsensical to continue to 
add steps to it. I believe it is a bug that {{StartupProgress}} accepts such 
steps instead of ignoring them; once a phase is complete, it should no longer 
change.

  was:
When testing out a cluster with the edit log tailing fast path feature enabled 
(HDFS-13150), an unrelated issue caused the NameNode to remain in safe mode for 
an extended period of time, preventing the NameNode from fully completing its 
startup sequence. We noticed that the Startup Progress web UI displayed many 
edit log segments (millions of them).

I traced this problem back to {{StartupProgress}}. Within {{FSEditLogLoader}}, 
the loader continually tries to update the startup progress with a new {{Step}} 
any time that it loads edits. Per the Javadoc for {{StartupProgress}}, this 
should be a no-op once startup is completed:
{code:title=StartupProgress.java}
 * After startup completes, the tracked data is frozen.  Any subsequent updates
 * or counter increments are no-ops.
{code}
However, {{StartupProgress}} only implements that logic once the _entire_ 
startup sequence has been completed. When {{FSEditLogLoader}} calls 
{{addStep()}}, it adds it into the {{LOADING_EDITS}} phase:
{code:title=FSEditLogLoader.java}
    StartupProgress prog = NameNode.getStartupProgress();
    Step step = createStartupProgressStep(edits);
    prog.beginStep(Phase.LOADING_EDITS, step);
{code}
This phase, in our case, ended long before, so it is nonsensical to continue to 
add steps to it. I believe it is a bug that {{StartupProgress}} accepts such 
steps instead of ignoring them; once a phase is complete, it should no longer 
change.


> NameNode StartupProgress continues to report edit log segments after the 
> LOADING_EDITS phase is finished
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-14500
>                 URL: https://issues.apache.org/jira/browse/HDFS-14500
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.2.0, 2.9.2, 3.0.3, 2.8.5, 3.1.2
>            Reporter: Erik Krogen
>            Assignee: Erik Krogen
>            Priority: Major
>             Fix For: 2.10.0, 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
>         Attachments: HDFS-14500-branch-2.001.patch, HDFS-14500.000.patch, 
> HDFS-14500.001.patch
>
>
> When testing out a cluster with the edit log tailing fast path feature 
> enabled (HDFS-13150), an unrelated issue caused the NameNode to remain in 
> safe mode for an extended period of time, preventing the NameNode from fully 
> completing its startup sequence. We noticed that the Startup Progress web UI 
> displayed many edit log segments (millions of them).
> I traced this problem back to {{{}StartupProgress{}}}. Within 
> {{{}FSEditLogLoader{}}}, the loader continually tries to update the startup 
> progress with a new {{Step}} any time that it loads edits. Per the Javadoc 
> for {{{}StartupProgress{}}}, this should be a no-op once startup is completed:
> {code:java|title=StartupProgress.java}
>  * After startup completes, the tracked data is frozen.  Any subsequent 
> updates
>  * or counter increments are no-ops.
> {code}
> However, {{StartupProgress}} only implements that logic once the _entire_ 
> startup sequence has been completed. When {{FSEditLogLoader}} calls 
> {{{}addStep(){}}}, it adds it into the {{LOADING_EDITS}} phase:
> {code:java|title=FSEditLogLoader.java}
>     StartupProgress prog = NameNode.getStartupProgress();
>     Step step = createStartupProgressStep(edits);
>     prog.beginStep(Phase.LOADING_EDITS, step);
> {code}
> This phase, in our case, ended long before, so it is nonsensical to continue 
> to add steps to it. I believe it is a bug that {{StartupProgress}} accepts 
> such steps instead of ignoring them; once a phase is complete, it should no 
> longer change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to