[
https://issues.apache.org/jira/browse/HDFS-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ke Han updated HDFS-17163:
--------------------------
Description:
When I performed the full-stop upgrade from 2.10.2 to 3.3.6. I noticed the
following error message:
*2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage:
Error reported on storage directory Storage Directory
/tmp/hadoop-root/dfs/namesecondary*
{code:java}
2023-08-17 10:43:11,407 INFO
org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Image file
/tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 of
size 340 bytes saved in 0 seconds .
2023-08-17 10:43:11,427 ERROR
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: RECEIVED SIGNAL 15:
SIGTERM
2023-08-17 10:43:11,434 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
FSImageSaver clean checkpoint: txid = 20 when meet shutdown.
2023-08-17 10:43:11,434 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down SecondaryNameNode at 5371b4aeefe1/192.168.78.3
************************************************************/
2023-08-17 10:43:11,663 WARN org.apache.hadoop.hdfs.server.namenode.FSImage:
Unable to rename checkpoint in Storage Directory
/tmp/hadoop-root/dfs/namesecondary
java.io.IOException: renaming
/tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 to
/tmp/hadoop-root/dfs/namesecondary/current/fsimage_0000000000000000020 FAILED
at
org.apache.hadoop.hdfs.server.namenode.FSImage.renameImageFileInDir(FSImage.java:1329)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.renameCheckpoint(FSImage.java:1263)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1224)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1172)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1105)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:563)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:360)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:325)
at
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:321)
at java.lang.Thread.run(Thread.java:750)
2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage:
Error reported on storage directory Storage Directory
/tmp/hadoop-root/dfs/namesecondary
2023-08-17 10:43:11,665 WARN org.apache.hadoop.hdfs.server.common.Storage:
About to remove corresponding storage: /tmp/hadoop-root/dfs/namesecondary {code}
The cluster I am using is four nodes: 1 NN, 1 SNN, 2 DN. The upgrade order is:
(1) Stop SNN (2) Stop NN (3) Stop DN1 and DN2. The error message occurs at SNN
when it's stoping.
The command sequence I was executing and configurations are appended. I tried
to reproduce it with the same command sequence, but it cannot be reproduced (I
repeatedly execute the command sequence + upgrade) for two thousand times. It
might require some special timing constraints. I am not sure whether this could
impact the data integrity.
== Command Sequence ==
{code:java}
// Start up cluster
bin/hdfs dfsadmin -safemode enter
bin/hdfs dfsadmin -rollingUpgrade prepare
bin/hdfs dfsadmin -safemode leave
// Execute commands
dfs -mkdir /ymlAOGQU
dfs -mkdir /ymlAOGQU/xXVm
dfs -touchz /ymlAOGQU/xXVm/xXVm.xml
dfs -mv /ymlAOGQU/xXVm /ymlAOGQU/
dfs -setacl -k -m acl / --set acl2 /
dfsadmin -saveNamespace
dfs -touchz /ymlAOGQU/xXVm.txt
dfs -put /tmp/upfuzz/hdfs/orDmixfM/D /ymlAOGQU/
dfs -rm -f -R -safely -skipTrash /ymlAOGQU/
dfsadmin -report -live -enteringmaintenance -inmaintenance
dfsadmin -saveNamespace
dfsadmin -report -dead -enteringmaintenance
dfsadmin -rollEdits
dfsadmin -refreshNodes
// stop SNN
// stop NN
// stop DN1&DN2{code}
was:
When I performed the full-stop upgrade from 2.10.2 to 3.3.6. I noticed the
following error message:
*2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage:
Error reported on storage directory Storage Directory
/tmp/hadoop-root/dfs/namesecondary*
{code:java}
2023-08-17 10:43:11,407 INFO
org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Image file
/tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 of
size 340 bytes saved in 0 seconds .
2023-08-17 10:43:11,427 ERROR
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: RECEIVED SIGNAL 15:
SIGTERM
2023-08-17 10:43:11,434 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
FSImageSaver clean checkpoint: txid = 20 when meet shutdown.
2023-08-17 10:43:11,434 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down SecondaryNameNode at 5371b4aeefe1/192.168.78.3
************************************************************/
2023-08-17 10:43:11,663 WARN org.apache.hadoop.hdfs.server.namenode.FSImage:
Unable to rename checkpoint in Storage Directory
/tmp/hadoop-root/dfs/namesecondary
java.io.IOException: renaming
/tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 to
/tmp/hadoop-root/dfs/namesecondary/current/fsimage_0000000000000000020 FAILED
at
org.apache.hadoop.hdfs.server.namenode.FSImage.renameImageFileInDir(FSImage.java:1329)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.renameCheckpoint(FSImage.java:1263)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1224)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1172)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1105)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:563)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:360)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:325)
at
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:321)
at java.lang.Thread.run(Thread.java:750)
2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage:
Error reported on storage directory Storage Directory
/tmp/hadoop-root/dfs/namesecondary
2023-08-17 10:43:11,665 WARN org.apache.hadoop.hdfs.server.common.Storage:
About to remove corresponding storage: /tmp/hadoop-root/dfs/namesecondary {code}
The cluster I am using is four nodes: 1 NN, 1 SNN, 2 DN. The upgrade order is:
(1) Stop SNN (2) Stop NN (3) Stop DN1 and DN2. The error message occurs at SNN
when it's stoping.
The command sequence I was executing and configurations are appended. I tried
to reproduce it with the same command sequence, but it cannot be reproduced (I
repeatedly execute the command sequence + upgrade) for two thousand times. It
might require some special timing constraints. I am not sure whether this could
impact the data integrity.
== Command Sequence ==
{code:java}
// Start up cluster
bin/hdfs dfsadmin -safemode enter
bin/hdfs dfsadmin -rollingUpgrade prepare
bin/hdfs dfsadmin -safemode leave
// Execute commands
dfs -mkdir /ymlAOGQU
dfs -mkdir /ymlAOGQU/xXVm
dfs -touchz /ymlAOGQU/xXVm/xXVm.xml
dfs -mv /ymlAOGQU/xXVm /ymlAOGQU/
dfs -setacl -k -m acl / --set acl2 /
dfsadmin -saveNamespace
dfs -touchz /ymlAOGQU/xXVm.txt
dfs -put /tmp/upfuzz/hdfs/orDmixfM/D /ymlAOGQU/
dfs -rm -f -R -safely -skipTrash /ymlAOGQU/
dfsadmin -report -live -enteringmaintenance -inmaintenance
dfsadmin -saveNamespace
dfsadmin -report -dead -enteringmaintenance
dfsadmin -rollEdits
dfsadmin -refreshNodes
// stop SNN
// stop NN
// stop DN1&DN2{code}
> ERROR Log Message when upgrading from 2.10.2 to 3.3.6
> -----------------------------------------------------
>
> Key: HDFS-17163
> URL: https://issues.apache.org/jira/browse/HDFS-17163
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.10.2
> Reporter: Ke Han
> Priority: Major
> Attachments: core-site.xml, hdfs-site.xml, log.tar.gz, orDmixfM.tar.gz
>
>
> When I performed the full-stop upgrade from 2.10.2 to 3.3.6. I noticed the
> following error message:
> *2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage:
> Error reported on storage directory Storage Directory
> /tmp/hadoop-root/dfs/namesecondary*
> {code:java}
> 2023-08-17 10:43:11,407 INFO
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Image file
> /tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020
> of size 340 bytes saved in 0 seconds .
> 2023-08-17 10:43:11,427 ERROR
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: RECEIVED SIGNAL 15:
> SIGTERM
> 2023-08-17 10:43:11,434 INFO org.apache.hadoop.hdfs.server.namenode.FSImage:
> FSImageSaver clean checkpoint: txid = 20 when meet shutdown.
> 2023-08-17 10:43:11,434 INFO
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down SecondaryNameNode at 5371b4aeefe1/192.168.78.3
> ************************************************************/
> 2023-08-17 10:43:11,663 WARN org.apache.hadoop.hdfs.server.namenode.FSImage:
> Unable to rename checkpoint in Storage Directory
> /tmp/hadoop-root/dfs/namesecondary
> java.io.IOException: renaming
> /tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020
> to /tmp/hadoop-root/dfs/namesecondary/current/fsimage_0000000000000000020
> FAILED
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.renameImageFileInDir(FSImage.java:1329)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.renameCheckpoint(FSImage.java:1263)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1224)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1172)
> at
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1105)
> at
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:563)
> at
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:360)
> at
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:325)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481)
> at
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:321)
> at java.lang.Thread.run(Thread.java:750)
> 2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage:
> Error reported on storage directory Storage Directory
> /tmp/hadoop-root/dfs/namesecondary
> 2023-08-17 10:43:11,665 WARN org.apache.hadoop.hdfs.server.common.Storage:
> About to remove corresponding storage: /tmp/hadoop-root/dfs/namesecondary
> {code}
>
> The cluster I am using is four nodes: 1 NN, 1 SNN, 2 DN. The upgrade order
> is: (1) Stop SNN (2) Stop NN (3) Stop DN1 and DN2. The error message occurs
> at SNN when it's stoping.
> The command sequence I was executing and configurations are appended. I tried
> to reproduce it with the same command sequence, but it cannot be reproduced
> (I repeatedly execute the command sequence + upgrade) for two thousand times.
> It might require some special timing constraints. I am not sure whether this
> could impact the data integrity.
> == Command Sequence ==
> {code:java}
> // Start up cluster
> bin/hdfs dfsadmin -safemode enter
> bin/hdfs dfsadmin -rollingUpgrade prepare
> bin/hdfs dfsadmin -safemode leave
> // Execute commands
> dfs -mkdir /ymlAOGQU
> dfs -mkdir /ymlAOGQU/xXVm
> dfs -touchz /ymlAOGQU/xXVm/xXVm.xml
> dfs -mv /ymlAOGQU/xXVm /ymlAOGQU/
> dfs -setacl -k -m acl / --set acl2 /
> dfsadmin -saveNamespace
> dfs -touchz /ymlAOGQU/xXVm.txt
> dfs -put /tmp/upfuzz/hdfs/orDmixfM/D /ymlAOGQU/
> dfs -rm -f -R -safely -skipTrash /ymlAOGQU/
> dfsadmin -report -live -enteringmaintenance -inmaintenance
> dfsadmin -saveNamespace
> dfsadmin -report -dead -enteringmaintenance
> dfsadmin -rollEdits
> dfsadmin -refreshNodes
> // stop SNN
> // stop NN
> // stop DN1&DN2{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]