[
https://issues.apache.org/jira/browse/HDFS-16950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17700860#comment-17700860
]
Karthik Palanisamy commented on HDFS-16950:
-------------------------------------------
For example:
NN meta dir:
{code:java}
-rw-r--r-- 1 hdfs hdfs 18K Mar 14 23:51 fsimage_0000000000000003493
-rw-r--r-- 1 hdfs hdfs 62 Mar 14 23:51 fsimage_0000000000000003493.md5
-rw-r--r-- 1 hdfs hdfs 193 Mar 14 23:51 VERSION
-rw-r--r-- 1 hdfs hdfs 1.0M Mar 15 00:06
edits_0000000000000003494-0000000000000003670
-rw-r--r-- 1 hdfs hdfs 2.3K Mar 15 00:13
edits_0000000000000003671-0000000000000003689
-rw-r--r-- 1 hdfs hdfs 1.0M Mar 15 00:14
edits_0000000000000003690-0000000000000003696
-rw-r--r-- 1 hdfs hdfs 2.3K Mar 15 00:18
edits_0000000000000003697-0000000000000003718
-rw-r--r-- 1 hdfs hdfs 5 Mar 15 00:18 seen_txid
-rw-r--r-- 1 hdfs hdfs 1.0M Mar 15 00:18 edits_inprogress_0000000000000003719
{code}
JN format is issued which removed all the edits in the JN meta dir:
{code:java}
2023-03-15 00:22:02,321 INFO [main] common.Storage
(Storage.java:clearDirectory(442)) - Will remove files:
[/data/dfs/jn/current/edits_0000000000000003337-0000000000000003487,
/data/dfs/jn/current/seen_txid,
/data/dfs/jn/current/edits_0000000000000003488-0000000000000003489,
/data/dfs/jn/current/VERSION,
/data/dfs/jn/current/edits_0000000000000003490-0000000000000003491,
/data/dfs/jn/current/edits_0000000000000003492-0000000000000003493,
/data/dfs/jn/current/edits_0000000000000003494-0000000000000003670,
/data/dfs/jn/current/edits_0000000000000003697-0000000000000003718,
/data/dfs/jn/current/edits_inprogress_0000000000000003719] {code}
In the end, it created a new log segment from edits_inprogress.
{code:java}
(FileJournalManager.java:finalizeLogSegment(145)) - Finalizing edits file
/data/dfs/jn/current/edits_inprogress_0000000000000003719 ->
/data/dfs/jn/current/edits_0000000000000003719-0000000000000003736 {code}
So we lost trxn between fsimage and edit_inprogress, resulting edit gap.
> Gap in edits after -initializeSharedEdits
> -----------------------------------------
>
> Key: HDFS-16950
> URL: https://issues.apache.org/jira/browse/HDFS-16950
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: journal-node, namenode
> Reporter: Karthik Palanisamy
> Priority: Major
>
> Namenode failed in the production cluster when JN role is migrated.
> {code:java}
> ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start
> namenode.
> java.io.IOException: There appears to be a gap in the edit log. We expected
> txid xxxxxx, but got txid xxxxxx. {code}
> InitializeSharedEdits issued as part of the role migration step. Note, no
> checkpoint is performed in the past few hours.
> InitializeSharedEdits created a new log segment from the edit_inprogres
> transaction and deleted all old transactions.
> My ask here is to delete any edit transaction older than the fimage
> transaction. But currently, it deletes all transactions and no check is
> enforced in JNStorage#format().
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]