[ 
https://issues.apache.org/jira/browse/HDFS-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Han updated HDFS-17163:
--------------------------
    Description: 
When I performed the full-stop upgrade from 2.10.2 to 3.3.6. I noticed the 
following error message:

*2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage: 
Error reported on storage directory Storage Directory 
/tmp/hadoop-root/dfs/namesecondary*
{code:java}
2023-08-17 10:43:11,407 INFO 
org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Image file 
/tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 of 
size 340 bytes saved in 0 seconds .
2023-08-17 10:43:11,427 ERROR 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: RECEIVED SIGNAL 15: 
SIGTERM
2023-08-17 10:43:11,434 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
FSImageSaver clean checkpoint: txid = 20 when meet shutdown.
2023-08-17 10:43:11,434 INFO 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down SecondaryNameNode at 5371b4aeefe1/192.168.78.3
************************************************************/
2023-08-17 10:43:11,663 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: 
Unable to rename checkpoint in Storage Directory 
/tmp/hadoop-root/dfs/namesecondary
java.io.IOException: renaming  
/tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 to 
/tmp/hadoop-root/dfs/namesecondary/current/fsimage_0000000000000000020 FAILED
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.renameImageFileInDir(FSImage.java:1329)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.renameCheckpoint(FSImage.java:1263)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1224)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1172)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1105)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:563)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:360)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:325)
        at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:321)
        at java.lang.Thread.run(Thread.java:750)
2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage: 
Error reported on storage directory Storage Directory 
/tmp/hadoop-root/dfs/namesecondary
2023-08-17 10:43:11,665 WARN org.apache.hadoop.hdfs.server.common.Storage: 
About to remove corresponding storage: /tmp/hadoop-root/dfs/namesecondary {code}
 

The cluster I am using is four nodes: 1 NN, 1 SNN, 2 DN. The upgrade order is: 
(1) Stop SNN (2) Stop NN (3) Stop DN1 and DN2. The error message occurs at SNN 
when it's stopping.

The command sequence I was executing and the configurations are appended. I 
tried to reproduce it with the same command sequence, but it cannot be 
reproduced (I repeatedly execute the command sequence + upgrade) two thousand 
times. It might require some special timing constraints. I am not sure whether 
this could impact the data integrity. 

== Command Sequence ==
{code:java}
// Start up cluster (2.10.2), 4 nodes
bin/hdfs dfsadmin -safemode enter
bin/hdfs dfsadmin -rollingUpgrade prepare
bin/hdfs dfsadmin -safemode leave

// Execute commands
dfs -mkdir /fHPXyTkv
dfs -put -f -p  /tmp/upfuzz/hdfs/XPkJEWYY/kPCH /fHPXyTkv/
dfs -put  -p -d /tmp/upfuzz/hdfs/XPkJEWYY/HdM /fHPXyTkv/kPCH/xoflDHK/lJ
dfsadmin -report -live  -decommissioning
dfsadmin -setSpaceQuota 1 -storageType ARCHIVE /fHPXyTkv/kPCH/xoflDHK/Ykc/AP
dfs -mkdir /fHPXyTkv/kPCH/xoflDHK/lJ/ozidF
dfs -mv /fHPXyTkv/kPCH/xoflDHK/Ykc /fHPXyTkv/kPCH/xoflDHK/lJ
dfs -mv /fHPXyTkv/kPCH/xoflDHK/lJ/AP /fHPXyTkv/kPCH/xoflDHK/eaSvvJyzZT/lL
dfsadmin -report  -dead -decommissioning -enteringmaintenance
dfsadmin -refreshNodes
dfs -mkdir /fHPXyTkv/kPCH/xoflDHK/lJ/ozidF/SpdyMzpNXmVEL
dfs -setacl  -k -m acl /kPCH/xoflDHK/lJ/ozidF --set acl2 
/kPCH/xoflDHK/eaSvvJyzZT/lL
dfsadmin -refreshNodes
dfsadmin -setSpaceQuota 85 -storageType PROVIDED /fHPXyTkv/kPCH/mduNyG
dfsadmin -saveNamespace
dfs -put -f -p -d /tmp/upfuzz/hdfs/XPkJEWYY/kPCH /fHPXyTkv/kPCH
dfsadmin -saveNamespace
dfs -mv /fHPXyTkv/kPCH/mduNyG/VZc /fHPXyTkv/kPCH/xoflDHK/Ykc/AP
dfsadmin -setSpaceQuota 85 -storageType PROVIDED 
/fHPXyTkv/kPCH/xoflDHK/eaSvvJyzZT/lL
dfs -put -f -p -d /tmp/upfuzz/hdfs/XPkJEWYY/kPCH /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc
dfsadmin -report  -dead  -enteringmaintenance -inmaintenance
dfsadmin -setSpaceQuota 1 -storageType SSD /fHPXyTkv/kPCH/xoflDHK/JgKqDE
dfs -put -f   /tmp/upfuzz/hdfs/XPkJEWYY/HdM 
/fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/kPCH/mduNyG/VZc
dfsadmin -rollEdits
dfs -cat  /fHPXyTkv/kPCH/kPCH/mduNyG/YPZ
dfs -ls  -d  -q  -S -r  /fHPXyTkv/kPCH
dfs -ls  -d  -q -t -S   /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/kPCH/xoflDHK/Ykc/AP
dfs -cat  /fHPXyTkv/kPCH/xoflDHK/lJ/HdM
dfs -cat -ignoreCrc /fHPXyTkv/kPCH/mduNyG/YPZ
dfs -cat -ignoreCrc /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/kPCH/mduNyG/YPZ
dfs -ls -C  -h -q   -r  /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/AP
dfs -cat -ignoreCrc /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/eJBcmWE
dfs -count -h -v -t DISK /fHPXyTkv/kPCH/kPCH/xoflDHK
dfs -count -q -h -x -u /fHPXyTkv/kPCH/xoflDHK/lJ
dfs -count -q /fHPXyTkv/kPCH/xoflDHK
dfs -cat  /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/eJBcmWE
dfs -ls    -q -t    /fHPXyTkv/kPCH/kPCH
dfs -cat  /fHPXyTkv/kPCH/mduNyG/YPZ
dfs -cat -ignoreCrc /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/kPCH/mduNyG/VZc/HdM // stop 
SNN
// stop NN
// stop DN1&DN2{code}

  was:
When I performed the full-stop upgrade from 2.10.2 to 3.3.6. I noticed the 
following error message:

*2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage: 
Error reported on storage directory Storage Directory 
/tmp/hadoop-root/dfs/namesecondary*
{code:java}
2023-08-17 10:43:11,407 INFO 
org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Image file 
/tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 of 
size 340 bytes saved in 0 seconds .
2023-08-17 10:43:11,427 ERROR 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: RECEIVED SIGNAL 15: 
SIGTERM
2023-08-17 10:43:11,434 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
FSImageSaver clean checkpoint: txid = 20 when meet shutdown.
2023-08-17 10:43:11,434 INFO 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down SecondaryNameNode at 5371b4aeefe1/192.168.78.3
************************************************************/
2023-08-17 10:43:11,663 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: 
Unable to rename checkpoint in Storage Directory 
/tmp/hadoop-root/dfs/namesecondary
java.io.IOException: renaming  
/tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 to 
/tmp/hadoop-root/dfs/namesecondary/current/fsimage_0000000000000000020 FAILED
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.renameImageFileInDir(FSImage.java:1329)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.renameCheckpoint(FSImage.java:1263)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1224)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1172)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1105)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:563)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:360)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:325)
        at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:321)
        at java.lang.Thread.run(Thread.java:750)
2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage: 
Error reported on storage directory Storage Directory 
/tmp/hadoop-root/dfs/namesecondary
2023-08-17 10:43:11,665 WARN org.apache.hadoop.hdfs.server.common.Storage: 
About to remove corresponding storage: /tmp/hadoop-root/dfs/namesecondary {code}
 

The cluster I am using is four nodes: 1 NN, 1 SNN, 2 DN. The upgrade order is: 
(1) Stop SNN (2) Stop NN (3) Stop DN1 and DN2. The error message occurs at SNN 
when it's stopping.

The command sequence I was executing and the configurations are appended. I 
tried to reproduce it with the same command sequence, but it cannot be 
reproduced (I repeatedly execute the command sequence + upgrade) two thousand 
times. It might require some special timing constraints. I am not sure whether 
this could impact the data integrity. 

== Command Sequence ==
{code:java}
// Start up cluster
bin/hdfs dfsadmin -safemode enter
bin/hdfs dfsadmin -rollingUpgrade prepare
bin/hdfs dfsadmin -safemode leave

// Execute commands
dfs -mkdir /ymlAOGQU
dfs -mkdir /ymlAOGQU/xXVm
dfs -touchz /ymlAOGQU/xXVm/xXVm.xml
dfs -mv /ymlAOGQU/xXVm /ymlAOGQU/
dfs -setacl  -k -m acl / --set acl2 /
dfsadmin -saveNamespace
dfs -touchz /ymlAOGQU/xXVm.txt
dfs -put    /tmp/upfuzz/hdfs/orDmixfM/D /ymlAOGQU/
dfs -rm -f -R -safely -skipTrash /ymlAOGQU/
dfsadmin -report -live   -enteringmaintenance -inmaintenance
dfsadmin -saveNamespace
dfsadmin -report  -dead  -enteringmaintenance
dfsadmin -rollEdits
dfsadmin -refreshNodes
// stop SNN
// stop NN
// stop DN1&DN2{code}


> ERROR Log Message when upgrading from 2.10.2 to 3.3.6
> -----------------------------------------------------
>
>                 Key: HDFS-17163
>                 URL: https://issues.apache.org/jira/browse/HDFS-17163
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.10.2
>            Reporter: Ke Han
>            Priority: Major
>         Attachments: core-site.xml, hdfs-site.xml, log.tar.gz, orDmixfM.tar.gz
>
>
> When I performed the full-stop upgrade from 2.10.2 to 3.3.6. I noticed the 
> following error message:
> *2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage: 
> Error reported on storage directory Storage Directory 
> /tmp/hadoop-root/dfs/namesecondary*
> {code:java}
> 2023-08-17 10:43:11,407 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Image file 
> /tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 
> of size 340 bytes saved in 0 seconds .
> 2023-08-17 10:43:11,427 ERROR 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: RECEIVED SIGNAL 15: 
> SIGTERM
> 2023-08-17 10:43:11,434 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> FSImageSaver clean checkpoint: txid = 20 when meet shutdown.
> 2023-08-17 10:43:11,434 INFO 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down SecondaryNameNode at 5371b4aeefe1/192.168.78.3
> ************************************************************/
> 2023-08-17 10:43:11,663 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Unable to rename checkpoint in Storage Directory 
> /tmp/hadoop-root/dfs/namesecondary
> java.io.IOException: renaming  
> /tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 
> to /tmp/hadoop-root/dfs/namesecondary/current/fsimage_0000000000000000020 
> FAILED
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.renameImageFileInDir(FSImage.java:1329)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.renameCheckpoint(FSImage.java:1263)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1224)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1172)
>         at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1105)
>         at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:563)
>         at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:360)
>         at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:325)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481)
>         at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:321)
>         at java.lang.Thread.run(Thread.java:750)
> 2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage: 
> Error reported on storage directory Storage Directory 
> /tmp/hadoop-root/dfs/namesecondary
> 2023-08-17 10:43:11,665 WARN org.apache.hadoop.hdfs.server.common.Storage: 
> About to remove corresponding storage: /tmp/hadoop-root/dfs/namesecondary 
> {code}
>  
> The cluster I am using is four nodes: 1 NN, 1 SNN, 2 DN. The upgrade order 
> is: (1) Stop SNN (2) Stop NN (3) Stop DN1 and DN2. The error message occurs 
> at SNN when it's stopping.
> The command sequence I was executing and the configurations are appended. I 
> tried to reproduce it with the same command sequence, but it cannot be 
> reproduced (I repeatedly execute the command sequence + upgrade) two thousand 
> times. It might require some special timing constraints. I am not sure 
> whether this could impact the data integrity. 
> == Command Sequence ==
> {code:java}
> // Start up cluster (2.10.2), 4 nodes
> bin/hdfs dfsadmin -safemode enter
> bin/hdfs dfsadmin -rollingUpgrade prepare
> bin/hdfs dfsadmin -safemode leave
> // Execute commands
> dfs -mkdir /fHPXyTkv
> dfs -put -f -p  /tmp/upfuzz/hdfs/XPkJEWYY/kPCH /fHPXyTkv/
> dfs -put  -p -d /tmp/upfuzz/hdfs/XPkJEWYY/HdM /fHPXyTkv/kPCH/xoflDHK/lJ
> dfsadmin -report -live  -decommissioning
> dfsadmin -setSpaceQuota 1 -storageType ARCHIVE /fHPXyTkv/kPCH/xoflDHK/Ykc/AP
> dfs -mkdir /fHPXyTkv/kPCH/xoflDHK/lJ/ozidF
> dfs -mv /fHPXyTkv/kPCH/xoflDHK/Ykc /fHPXyTkv/kPCH/xoflDHK/lJ
> dfs -mv /fHPXyTkv/kPCH/xoflDHK/lJ/AP /fHPXyTkv/kPCH/xoflDHK/eaSvvJyzZT/lL
> dfsadmin -report  -dead -decommissioning -enteringmaintenance
> dfsadmin -refreshNodes
> dfs -mkdir /fHPXyTkv/kPCH/xoflDHK/lJ/ozidF/SpdyMzpNXmVEL
> dfs -setacl  -k -m acl /kPCH/xoflDHK/lJ/ozidF --set acl2 
> /kPCH/xoflDHK/eaSvvJyzZT/lL
> dfsadmin -refreshNodes
> dfsadmin -setSpaceQuota 85 -storageType PROVIDED /fHPXyTkv/kPCH/mduNyG
> dfsadmin -saveNamespace
> dfs -put -f -p -d /tmp/upfuzz/hdfs/XPkJEWYY/kPCH /fHPXyTkv/kPCH
> dfsadmin -saveNamespace
> dfs -mv /fHPXyTkv/kPCH/mduNyG/VZc /fHPXyTkv/kPCH/xoflDHK/Ykc/AP
> dfsadmin -setSpaceQuota 85 -storageType PROVIDED 
> /fHPXyTkv/kPCH/xoflDHK/eaSvvJyzZT/lL
> dfs -put -f -p -d /tmp/upfuzz/hdfs/XPkJEWYY/kPCH 
> /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc
> dfsadmin -report  -dead  -enteringmaintenance -inmaintenance
> dfsadmin -setSpaceQuota 1 -storageType SSD /fHPXyTkv/kPCH/xoflDHK/JgKqDE
> dfs -put -f   /tmp/upfuzz/hdfs/XPkJEWYY/HdM 
> /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/kPCH/mduNyG/VZc
> dfsadmin -rollEdits
> dfs -cat  /fHPXyTkv/kPCH/kPCH/mduNyG/YPZ
> dfs -ls  -d  -q  -S -r  /fHPXyTkv/kPCH
> dfs -ls  -d  -q -t -S   /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/kPCH/xoflDHK/Ykc/AP
> dfs -cat  /fHPXyTkv/kPCH/xoflDHK/lJ/HdM
> dfs -cat -ignoreCrc /fHPXyTkv/kPCH/mduNyG/YPZ
> dfs -cat -ignoreCrc /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/kPCH/mduNyG/YPZ
> dfs -ls -C  -h -q   -r  /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/AP
> dfs -cat -ignoreCrc /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/eJBcmWE
> dfs -count -h -v -t DISK /fHPXyTkv/kPCH/kPCH/xoflDHK
> dfs -count -q -h -x -u /fHPXyTkv/kPCH/xoflDHK/lJ
> dfs -count -q /fHPXyTkv/kPCH/xoflDHK
> dfs -cat  /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/eJBcmWE
> dfs -ls    -q -t    /fHPXyTkv/kPCH/kPCH
> dfs -cat  /fHPXyTkv/kPCH/mduNyG/YPZ
> dfs -cat -ignoreCrc /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/kPCH/mduNyG/VZc/HdM // 
> stop SNN
> // stop NN
> // stop DN1&DN2{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to