[jira] [Commented] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog

2020-05-11 Thread Yicong Cai (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104187#comment-17104187
 ] 

Yicong Cai commented on HDFS-15175:
---

Hi, [~wanchang]

We solved this problem by completely resetting the OP object.
At present, I have not passed this problem through UT use cases, so I have not 
provided a patch for repair. Do you have a UT use case that reproduces this 
problem?

> Multiple CloseOp shared block instance causes the standby namenode to crash 
> when rolling editlog
> 
>
> Key: HDFS-15175
> URL: https://issues.apache.org/jira/browse/HDFS-15175
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Critical
>
>  
> {panel:title=Crash exception}
> 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
> tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
> [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
> atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
> permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, 
> clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
> txid=32625024993]
>  java.io.IOException: File is not under construction: ..
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:360)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
>  at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
> {panel}
>  
> {panel:title=Editlog}
> 
>  OP_REASSIGN_LEASE
>  
>  32625021150
>  DFSClient_NONMAPREDUCE_-969060727_197760
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625023743
>  0
>  0
>  ..
>  3
>  1581816135883
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> ..
> 
>  OP_TRUNCATE
>  
>  32625024049
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  ..
>  185818644
>  1581816136336
>  
>  5568434562
>  185818648
>  4495417845
>  
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625024993
>  0
>  0
>  ..
>  3
>  1581816138774
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> {panel}
>  
>  
> The block size should be 185818648 in the first CloseOp. When truncate is 
> used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is 
> synchronized to the JournalNode in the same batch. The block used by CloseOp 
> twice is the same instance, which causes the first CloseOp has wrong block 
> size. When SNN rolling Editlog, TruncateOp does not make the file to the 
> UnderConstruction state. Then, when the second CloseOp is executed, the file 
> is not in the UnderConstruction state, and SNN crashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog

2020-02-17 Thread Yicong Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-15175:
--
Description: 
 
{panel:title=Crash exception}
2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
[length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, 
clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
txid=32625024993]
 java.io.IOException: File is not under construction: ..
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:360)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
 at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
{panel}
 
{panel:title=Editlog}

 OP_REASSIGN_LEASE
 
 32625021150
 DFSClient_NONMAPREDUCE_-969060727_197760
 ..
 DFSClient_NONMAPREDUCE_1000868229_201260
 
 

..


 OP_CLOSE
 
 32625023743
 0
 0
 ..
 3
 1581816135883
 1581814760398
 536870912
 
 
 false
 
 5568434562
 185818644
 4495417845
 
 
 da_music
 hdfs
 416
 
 
 

..


 OP_TRUNCATE
 
 32625024049
 ..
 DFSClient_NONMAPREDUCE_1000868229_201260
 ..
 185818644
 1581816136336
 
 5568434562
 185818648
 4495417845
 
 
 

..


 OP_CLOSE
 
 32625024993
 0
 0
 ..
 3
 1581816138774
 1581814760398
 536870912
 
 
 false
 
 5568434562
 185818644
 4495417845
 
 
 da_music
 hdfs
 416
 
 
 
{panel}
 

 

The block size should be 185818648 in the first CloseOp. When truncate is used, 
the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is 
synchronized to the JournalNode in the same batch. The block used by CloseOp 
twice is the same instance, which causes the first CloseOp has wrong block 
size. When SNN rolling Editlog, TruncateOp does not make the file to the 
UnderConstruction state. Then, when the second CloseOp is executed, the file is 
not in the UnderConstruction state, and SNN crashes.

  was:
 
{panel:title=Crash exception}
2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
[length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, 
clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
txid=32625024993]
java.io.IOException: File is not under construction: ..
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
at 

[jira] [Created] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog

2020-02-17 Thread Yicong Cai (Jira)
Yicong Cai created HDFS-15175:
-

 Summary: Multiple CloseOp shared block instance causes the standby 
namenode to crash when rolling editlog
 Key: HDFS-15175
 URL: https://issues.apache.org/jira/browse/HDFS-15175
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.9.2
Reporter: Yicong Cai
Assignee: Yicong Cai


 
{panel:title=Crash exception}
2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
[length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, 
clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
txid=32625024993]
java.io.IOException: File is not under construction: ..
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
{panel}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage

2019-08-20 Thread Yicong Cai (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911074#comment-16911074
 ] 

Yicong Cai commented on HDFS-14311:
---

Thanks [~sodonnell] [~surendrasingh] [~jojochuang] for your attention and 
review on this issue. 

It is very difficult to use UT to reproduce, I have failed. I first modified 
the check style related issues, I will continue to try to reproduce the problem 
with UT.

> multi-threading conflict at layoutVersion when loading block pool storage
> -
>
> Key: HDFS-14311
> URL: https://issues.apache.org/jira/browse/HDFS-14311
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rolling upgrades
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14311.1.patch, HDFS-14311.2.patch, 
> HDFS-14311.branch-2.1.patch
>
>
> When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at 
> StorageInfo.layoutVersion in loading block pool storage process.
> It will cause this exception:
>  
> {panel:title=exceptions}
> 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] 
> - Restored 36974 block files from trash before the layout upgrade. These 
> blocks will be moved to the previous directory during the upgrade
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] 
> - Failed to analyze storage directories for block pool 
> BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748)
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed 
> to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block 
> pool BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748) 
> {panel}
>  
> root cause:
> BlockPoolSliceStorage instance is shared for all storage locations recover 
> transition. In BlockPoolSliceStorage.doTransition, it will read the old 

[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage

2019-08-20 Thread Yicong Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14311:
--
Attachment: HDFS-14311.branch-2.1.patch

> multi-threading conflict at layoutVersion when loading block pool storage
> -
>
> Key: HDFS-14311
> URL: https://issues.apache.org/jira/browse/HDFS-14311
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rolling upgrades
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14311.1.patch, HDFS-14311.2.patch, 
> HDFS-14311.branch-2.1.patch
>
>
> When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at 
> StorageInfo.layoutVersion in loading block pool storage process.
> It will cause this exception:
>  
> {panel:title=exceptions}
> 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] 
> - Restored 36974 block files from trash before the layout upgrade. These 
> blocks will be moved to the previous directory during the upgrade
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] 
> - Failed to analyze storage directories for block pool 
> BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748)
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed 
> to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block 
> pool BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748) 
> {panel}
>  
> root cause:
> BlockPoolSliceStorage instance is shared for all storage locations recover 
> transition. In BlockPoolSliceStorage.doTransition, it will read the old 
> layoutVersion from local storage, compare with current DataNode version, then 
> do upgrade. In doUpgrade, add the transition work as a sub-thread, the 
> transition work will set the BlockPoolSliceStorage's layoutVersion to current 
> DN version. The next 

[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage

2019-08-20 Thread Yicong Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14311:
--
Attachment: HDFS-14311.2.patch

> multi-threading conflict at layoutVersion when loading block pool storage
> -
>
> Key: HDFS-14311
> URL: https://issues.apache.org/jira/browse/HDFS-14311
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rolling upgrades
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14311.1.patch, HDFS-14311.2.patch
>
>
> When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at 
> StorageInfo.layoutVersion in loading block pool storage process.
> It will cause this exception:
>  
> {panel:title=exceptions}
> 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] 
> - Restored 36974 block files from trash before the layout upgrade. These 
> blocks will be moved to the previous directory during the upgrade
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] 
> - Failed to analyze storage directories for block pool 
> BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748)
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed 
> to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block 
> pool BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748) 
> {panel}
>  
> root cause:
> BlockPoolSliceStorage instance is shared for all storage locations recover 
> transition. In BlockPoolSliceStorage.doTransition, it will read the old 
> layoutVersion from local storage, compare with current DataNode version, then 
> do upgrade. In doUpgrade, add the transition work as a sub-thread, the 
> transition work will set the BlockPoolSliceStorage's layoutVersion to current 
> DN version. The next storage dir transition check will concurrent 

[jira] [Commented] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage

2019-08-13 Thread Yicong Cai (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906309#comment-16906309
 ] 

Yicong Cai commented on HDFS-14311:
---

[~sodonnell] Thanks for your detailed reply. I will add the corresponding 
question replication use cases and adjust the code format.

> multi-threading conflict at layoutVersion when loading block pool storage
> -
>
> Key: HDFS-14311
> URL: https://issues.apache.org/jira/browse/HDFS-14311
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rolling upgrades
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14311.1.patch
>
>
> When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at 
> StorageInfo.layoutVersion in loading block pool storage process.
> It will cause this exception:
>  
> {panel:title=exceptions}
> 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] 
> - Restored 36974 block files from trash before the layout upgrade. These 
> blocks will be moved to the previous directory during the upgrade
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] 
> - Failed to analyze storage directories for block pool 
> BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748)
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed 
> to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block 
> pool BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748) 
> {panel}
>  
> root cause:
> BlockPoolSliceStorage instance is shared for all storage locations recover 
> transition. In BlockPoolSliceStorage.doTransition, it will read the old 
> layoutVersion from local storage, compare with current DataNode version, then 
> do upgrade. In doUpgrade, add the transition work as a sub-thread, the 
> transition work will set 

[jira] [Commented] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-06-28 Thread Yicong Cai (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875357#comment-16875357
 ] 

Yicong Cai commented on HDFS-14429:
---

[~jojochuang]

Before fixing this issue, the decommissing block will not complete the block, 
so the Redundancy check will not be performed. After fixing the problem, the 
Redundancy check will be performed and updateNeededReconstructions will be 
performed. The replication of the maintenance is Effective, but the 
decommission is not, so a neededReconstruction.update will cause curReplicas to 
be negative.

 
{code:java}
// handle low redundancy/extra redundancy
short fileRedundancy = getExpectedRedundancyNum(storedBlock);
if (!isNeededReconstruction(storedBlock, num, pendingNum)) {
  neededReconstruction.remove(storedBlock, numCurrentReplica,
  num.readOnlyReplicas(), num.outOfServiceReplicas(), fileRedundancy);
} else {
  // Perform update
  updateNeededReconstructions(storedBlock, curReplicaDelta, 0);
}
{code}
{code:java}
if (!hasEnoughEffectiveReplicas(block, repl, pendingNum)) {
  neededReconstruction.update(block, repl.liveReplicas() + pendingNum,
  repl.readOnlyReplicas(), repl.outOfServiceReplicas(),
  curExpectedReplicas, curReplicasDelta, expectedReplicasDelta);
}
{code}
{code:java}
synchronized void update(BlockInfo block, int curReplicas,
int readOnlyReplicas, int outOfServiceReplicas,
int curExpectedReplicas,
int curReplicasDelta, int expectedReplicasDelta) {
  // Cause Negative here
  int oldReplicas = curReplicas-curReplicasDelta;
  int oldExpectedReplicas = curExpectedReplicas-expectedReplicasDelta;
  int curPri = getPriority(block, curReplicas, readOnlyReplicas,
  outOfServiceReplicas, curExpectedReplicas);
  int oldPri = getPriority(block, oldReplicas, readOnlyReplicas,
  outOfServiceReplicas, oldExpectedReplicas);
  if(NameNode.stateChangeLog.isDebugEnabled()) {
NameNode.stateChangeLog.debug("LowRedundancyBlocks.update " +
  block +
  " curReplicas " + curReplicas +
  " curExpectedReplicas " + curExpectedReplicas +
  " oldReplicas " + oldReplicas +
  " oldExpectedReplicas  " + oldExpectedReplicas +
  " curPri  " + curPri +
  " oldPri  " + oldPri);
  }
  // oldPri is mostly correct, but not always. If not found with oldPri,
  // other levels will be searched until the block is found & removed.
  remove(block, oldPri, oldExpectedReplicas);
  if(add(block, curPri, curExpectedReplicas)) {
NameNode.blockStateChangeLog.debug(
"BLOCK* NameSystem.LowRedundancyBlock.update: {} has only {} "
+ "replicas and needs {} replicas so is added to "
+ "neededReconstructions at priority level {}",
block, curReplicas, curExpectedReplicas, curPri);

  }
}
{code}

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch, 
> HDFS-14429.branch-2.02.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> 

[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-06-24 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14429:
--
Attachment: HDFS-14429.03.patch

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch, 
> HDFS-14429.branch-2.02.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-06-24 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14429:
--
Attachment: HDFS-14429.branch-2.02.patch

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.branch-2.01.patch, HDFS-14429.branch-2.02.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-06-24 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14429:
--
Attachment: (was: HDFS-14429.branch-2.02.patch)

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.branch-2.01.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-06-24 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14429:
--
Attachment: (was: HDFS-14429.03.patch)

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.branch-2.01.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-06-24 Thread Yicong Cai (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871032#comment-16871032
 ] 

Yicong Cai commented on HDFS-14429:
---

Thanks [~hexiaoqiao] for reviewing my patch. I have modified the three issues 
you mentioned a/b/c.

 

trunk: [^HDFS-14429.03.patch]

branch-2: [^HDFS-14429.branch-2.02.patch]

 

d. Do we need add {{pendingNum}} when calc numUsableReplicas?

{color:#FF}No need to add pendingNum. Because only the FINALIZED block and 
reach the minimum replication, COMPLETE can be entered. The Pending Block is 
not a FINALIZED block.{color}

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch, 
> HDFS-14429.branch-2.02.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-06-24 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14429:
--
Attachment: HDFS-14429.branch-2.02.patch

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch, 
> HDFS-14429.branch-2.02.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-06-24 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14429:
--
Attachment: HDFS-14429.03.patch

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-06-23 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14429:
--
Attachment: HDFS-14429.branch-2.01.patch

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.branch-2.01.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-06-23 Thread Yicong Cai (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870579#comment-16870579
 ] 

Yicong Cai commented on HDFS-14429:
---

Provided branch-2 [^HDFS-14429.branch-2.01.patch] and trunck 
[^HDFS-14429.02.patch] [~jojochuang]

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.branch-2.01.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-06-23 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14429:
--
Target Version/s: 2.10.0, 3.3.0, 2.9.3  (was: 3.3.0, 2.9.3)

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.branch-2.01.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-06-23 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14429:
--
Attachment: HDFS-14429.02.patch

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-06-23 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14429:
--
Attachment: (was: HDFS-14429.02.patch)

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-06-23 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14429:
--
Attachment: HDFS-14429.02.patch

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.

2019-06-18 Thread Yicong Cai (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867237#comment-16867237
 ] 

Yicong Cai commented on HDFS-14465:
---

[~jojochuang] [^HDFS-14465.branch-2.9.01.patch] is ready.

> When the Block expected replications is larger than the number of DataNodes, 
> entering maintenance will never exit.
> --
>
> Key: HDFS-14465
> URL: https://issues.apache.org/jira/browse/HDFS-14465
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-14465.01.patch, HDFS-14465.02.patch, 
> HDFS-14465.branch-2.9.01.patch
>
>
> Scenes:
> There is a small HDFS cluster with 5 DataNodes; one of them is maintained, 
> added to the maintenance list, and set 
> dfs.namenode.maintenance.replication.min to 1.
> When refresh Nodes, the NameNode starts checking whether the blocks on the 
> node require a new replication.
> The replications of the MapReduce task job file is 10 by default, 
> isNeededReplicationForMaintenance will determine to false, and 
> isSufficientlyReplicated will determine to false, so the block of the job 
> file needs to increase the replication.
> When adding a replication, since the cluster has only 5 DataNodes, all the 
> nodes have the replications of the block, chooseTargetInOrder will throw a 
> NotEnoughReplicasException, so that the replication cannot be increase, and 
> the Entering Maintenance cannot be ended.
> This issue will cause the independent small cluster to be unable to use the 
> maintenance mode.
>  
> {panel:title=chooseTarget exception log}
> 2019-05-03 23:42:31,008 [31545331] - WARN  
> [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough 
> replicas, still in need of 1 to reach 5 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For 
> more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and 
> org.apache.hadoop.net.NetworkTopology
> {panel}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.

2019-06-18 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14465:
--
Attachment: HDFS-14465.branch-2.9.01.patch

> When the Block expected replications is larger than the number of DataNodes, 
> entering maintenance will never exit.
> --
>
> Key: HDFS-14465
> URL: https://issues.apache.org/jira/browse/HDFS-14465
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-14465.01.patch, HDFS-14465.02.patch, 
> HDFS-14465.branch-2.9.01.patch
>
>
> Scenes:
> There is a small HDFS cluster with 5 DataNodes; one of them is maintained, 
> added to the maintenance list, and set 
> dfs.namenode.maintenance.replication.min to 1.
> When refresh Nodes, the NameNode starts checking whether the blocks on the 
> node require a new replication.
> The replications of the MapReduce task job file is 10 by default, 
> isNeededReplicationForMaintenance will determine to false, and 
> isSufficientlyReplicated will determine to false, so the block of the job 
> file needs to increase the replication.
> When adding a replication, since the cluster has only 5 DataNodes, all the 
> nodes have the replications of the block, chooseTargetInOrder will throw a 
> NotEnoughReplicasException, so that the replication cannot be increase, and 
> the Entering Maintenance cannot be ended.
> This issue will cause the independent small cluster to be unable to use the 
> maintenance mode.
>  
> {panel:title=chooseTarget exception log}
> 2019-05-03 23:42:31,008 [31545331] - WARN  
> [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough 
> replicas, still in need of 1 to reach 5 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For 
> more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and 
> org.apache.hadoop.net.NetworkTopology
> {panel}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.

2019-06-18 Thread Yicong Cai (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866890#comment-16866890
 ] 

Yicong Cai commented on HDFS-14465:
---

Okay, I'll provide branch-2 patch as soon as possible.[~jojochuang]

> When the Block expected replications is larger than the number of DataNodes, 
> entering maintenance will never exit.
> --
>
> Key: HDFS-14465
> URL: https://issues.apache.org/jira/browse/HDFS-14465
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-14465.01.patch, HDFS-14465.02.patch
>
>
> Scenes:
> There is a small HDFS cluster with 5 DataNodes; one of them is maintained, 
> added to the maintenance list, and set 
> dfs.namenode.maintenance.replication.min to 1.
> When refresh Nodes, the NameNode starts checking whether the blocks on the 
> node require a new replication.
> The replications of the MapReduce task job file is 10 by default, 
> isNeededReplicationForMaintenance will determine to false, and 
> isSufficientlyReplicated will determine to false, so the block of the job 
> file needs to increase the replication.
> When adding a replication, since the cluster has only 5 DataNodes, all the 
> nodes have the replications of the block, chooseTargetInOrder will throw a 
> NotEnoughReplicasException, so that the replication cannot be increase, and 
> the Entering Maintenance cannot be ended.
> This issue will cause the independent small cluster to be unable to use the 
> maintenance mode.
>  
> {panel:title=chooseTarget exception log}
> 2019-05-03 23:42:31,008 [31545331] - WARN  
> [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough 
> replicas, still in need of 1 to reach 5 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For 
> more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and 
> org.apache.hadoop.net.NetworkTopology
> {panel}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-06-18 Thread Yicong Cai (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866891#comment-16866891
 ] 

Yicong Cai commented on HDFS-14429:
---

Okay, I'll provide relevant test cases.

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.

2019-05-07 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14465:
--
Attachment: HDFS-14465.02.patch
Status: Patch Available  (was: Open)

> When the Block expected replications is larger than the number of DataNodes, 
> entering maintenance will never exit.
> --
>
> Key: HDFS-14465
> URL: https://issues.apache.org/jira/browse/HDFS-14465
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
> Attachments: HDFS-14465.01.patch, HDFS-14465.02.patch
>
>
> Scenes:
> There is a small HDFS cluster with 5 DataNodes; one of them is maintained, 
> added to the maintenance list, and set 
> dfs.namenode.maintenance.replication.min to 1.
> When refresh Nodes, the NameNode starts checking whether the blocks on the 
> node require a new replication.
> The replications of the MapReduce task job file is 10 by default, 
> isNeededReplicationForMaintenance will determine to false, and 
> isSufficientlyReplicated will determine to false, so the block of the job 
> file needs to increase the replication.
> When adding a replication, since the cluster has only 5 DataNodes, all the 
> nodes have the replications of the block, chooseTargetInOrder will throw a 
> NotEnoughReplicasException, so that the replication cannot be increase, and 
> the Entering Maintenance cannot be ended.
> This issue will cause the independent small cluster to be unable to use the 
> maintenance mode.
>  
> {panel:title=chooseTarget exception log}
> 2019-05-03 23:42:31,008 [31545331] - WARN  
> [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough 
> replicas, still in need of 1 to reach 5 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For 
> more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and 
> org.apache.hadoop.net.NetworkTopology
> {panel}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.

2019-05-07 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14465:
--
Status: Open  (was: Patch Available)

> When the Block expected replications is larger than the number of DataNodes, 
> entering maintenance will never exit.
> --
>
> Key: HDFS-14465
> URL: https://issues.apache.org/jira/browse/HDFS-14465
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
> Attachments: HDFS-14465.01.patch, HDFS-14465.02.patch
>
>
> Scenes:
> There is a small HDFS cluster with 5 DataNodes; one of them is maintained, 
> added to the maintenance list, and set 
> dfs.namenode.maintenance.replication.min to 1.
> When refresh Nodes, the NameNode starts checking whether the blocks on the 
> node require a new replication.
> The replications of the MapReduce task job file is 10 by default, 
> isNeededReplicationForMaintenance will determine to false, and 
> isSufficientlyReplicated will determine to false, so the block of the job 
> file needs to increase the replication.
> When adding a replication, since the cluster has only 5 DataNodes, all the 
> nodes have the replications of the block, chooseTargetInOrder will throw a 
> NotEnoughReplicasException, so that the replication cannot be increase, and 
> the Entering Maintenance cannot be ended.
> This issue will cause the independent small cluster to be unable to use the 
> maintenance mode.
>  
> {panel:title=chooseTarget exception log}
> 2019-05-03 23:42:31,008 [31545331] - WARN  
> [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough 
> replicas, still in need of 1 to reach 5 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For 
> more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and 
> org.apache.hadoop.net.NetworkTopology
> {panel}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.

2019-05-07 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14465:
--
Attachment: (was: HDFS-14465.02.patch)

> When the Block expected replications is larger than the number of DataNodes, 
> entering maintenance will never exit.
> --
>
> Key: HDFS-14465
> URL: https://issues.apache.org/jira/browse/HDFS-14465
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
> Attachments: HDFS-14465.01.patch
>
>
> Scenes:
> There is a small HDFS cluster with 5 DataNodes; one of them is maintained, 
> added to the maintenance list, and set 
> dfs.namenode.maintenance.replication.min to 1.
> When refresh Nodes, the NameNode starts checking whether the blocks on the 
> node require a new replication.
> The replications of the MapReduce task job file is 10 by default, 
> isNeededReplicationForMaintenance will determine to false, and 
> isSufficientlyReplicated will determine to false, so the block of the job 
> file needs to increase the replication.
> When adding a replication, since the cluster has only 5 DataNodes, all the 
> nodes have the replications of the block, chooseTargetInOrder will throw a 
> NotEnoughReplicasException, so that the replication cannot be increase, and 
> the Entering Maintenance cannot be ended.
> This issue will cause the independent small cluster to be unable to use the 
> maintenance mode.
>  
> {panel:title=chooseTarget exception log}
> 2019-05-03 23:42:31,008 [31545331] - WARN  
> [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough 
> replicas, still in need of 1 to reach 5 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For 
> more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and 
> org.apache.hadoop.net.NetworkTopology
> {panel}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.

2019-05-07 Thread Yicong Cai (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834534#comment-16834534
 ] 

Yicong Cai commented on HDFS-14465:
---

[^HDFS-14465.02.patch] Fix checkstyle.

Had tested the hadoop.hdfs.web.TestWebHdfsTimeouts use case separately and it 
works fine.

> When the Block expected replications is larger than the number of DataNodes, 
> entering maintenance will never exit.
> --
>
> Key: HDFS-14465
> URL: https://issues.apache.org/jira/browse/HDFS-14465
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
> Attachments: HDFS-14465.01.patch, HDFS-14465.02.patch
>
>
> Scenes:
> There is a small HDFS cluster with 5 DataNodes; one of them is maintained, 
> added to the maintenance list, and set 
> dfs.namenode.maintenance.replication.min to 1.
> When refresh Nodes, the NameNode starts checking whether the blocks on the 
> node require a new replication.
> The replications of the MapReduce task job file is 10 by default, 
> isNeededReplicationForMaintenance will determine to false, and 
> isSufficientlyReplicated will determine to false, so the block of the job 
> file needs to increase the replication.
> When adding a replication, since the cluster has only 5 DataNodes, all the 
> nodes have the replications of the block, chooseTargetInOrder will throw a 
> NotEnoughReplicasException, so that the replication cannot be increase, and 
> the Entering Maintenance cannot be ended.
> This issue will cause the independent small cluster to be unable to use the 
> maintenance mode.
>  
> {panel:title=chooseTarget exception log}
> 2019-05-03 23:42:31,008 [31545331] - WARN  
> [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough 
> replicas, still in need of 1 to reach 5 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For 
> more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and 
> org.apache.hadoop.net.NetworkTopology
> {panel}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.

2019-05-07 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14465:
--
Attachment: HDFS-14465.02.patch

> When the Block expected replications is larger than the number of DataNodes, 
> entering maintenance will never exit.
> --
>
> Key: HDFS-14465
> URL: https://issues.apache.org/jira/browse/HDFS-14465
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
> Attachments: HDFS-14465.01.patch, HDFS-14465.02.patch
>
>
> Scenes:
> There is a small HDFS cluster with 5 DataNodes; one of them is maintained, 
> added to the maintenance list, and set 
> dfs.namenode.maintenance.replication.min to 1.
> When refresh Nodes, the NameNode starts checking whether the blocks on the 
> node require a new replication.
> The replications of the MapReduce task job file is 10 by default, 
> isNeededReplicationForMaintenance will determine to false, and 
> isSufficientlyReplicated will determine to false, so the block of the job 
> file needs to increase the replication.
> When adding a replication, since the cluster has only 5 DataNodes, all the 
> nodes have the replications of the block, chooseTargetInOrder will throw a 
> NotEnoughReplicasException, so that the replication cannot be increase, and 
> the Entering Maintenance cannot be ended.
> This issue will cause the independent small cluster to be unable to use the 
> maintenance mode.
>  
> {panel:title=chooseTarget exception log}
> 2019-05-03 23:42:31,008 [31545331] - WARN  
> [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough 
> replicas, still in need of 1 to reach 5 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For 
> more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and 
> org.apache.hadoop.net.NetworkTopology
> {panel}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-05-07 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14429:
--
  Attachment: HDFS-14429.01.patch
Target Version/s: 3.3.0, 2.9.3  (was: 2.9.3)
  Status: Patch Available  (was: Open)

> Block remain in COMMITTED but not COMPLETE cause by Decommission
> 
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
> Attachments: HDFS-14429.01.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage

2019-05-07 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14311:
--
Attachment: (was: HDFS-14311.1.patch)

> multi-threading conflict at layoutVersion when loading block pool storage
> -
>
> Key: HDFS-14311
> URL: https://issues.apache.org/jira/browse/HDFS-14311
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rolling upgrades
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
> Fix For: 3.3.0, 2.9.3
>
> Attachments: HDFS-14311.1.patch
>
>
> When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at 
> StorageInfo.layoutVersion in loading block pool storage process.
> It will cause this exception:
>  
> {panel:title=exceptions}
> 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] 
> - Restored 36974 block files from trash before the layout upgrade. These 
> blocks will be moved to the previous directory during the upgrade
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] 
> - Failed to analyze storage directories for block pool 
> BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748)
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed 
> to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block 
> pool BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748) 
> {panel}
>  
> root cause:
> BlockPoolSliceStorage instance is shared for all storage locations recover 
> transition. In BlockPoolSliceStorage.doTransition, it will read the old 
> layoutVersion from local storage, compare with current DataNode version, then 
> do upgrade. In doUpgrade, add the transition work as a sub-thread, the 
> transition work will set the BlockPoolSliceStorage's layoutVersion to current 
> DN version. The next storage dir transition check will concurrent with 

[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage

2019-05-07 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14311:
--
   Fix Version/s: (was: 2.9.3)
  (was: 3.3.0)
  Attachment: HDFS-14311.1.patch
Target Version/s: 3.3.0, 2.9.3
  Status: Patch Available  (was: Open)

> multi-threading conflict at layoutVersion when loading block pool storage
> -
>
> Key: HDFS-14311
> URL: https://issues.apache.org/jira/browse/HDFS-14311
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rolling upgrades
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
> Attachments: HDFS-14311.1.patch
>
>
> When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at 
> StorageInfo.layoutVersion in loading block pool storage process.
> It will cause this exception:
>  
> {panel:title=exceptions}
> 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] 
> - Restored 36974 block files from trash before the layout upgrade. These 
> blocks will be moved to the previous directory during the upgrade
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] 
> - Failed to analyze storage directories for block pool 
> BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748)
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed 
> to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block 
> pool BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748) 
> {panel}
>  
> root cause:
> BlockPoolSliceStorage instance is shared for all storage locations recover 
> transition. In BlockPoolSliceStorage.doTransition, it will read the old 
> layoutVersion from local storage, compare with current DataNode version, then 
> do upgrade. In doUpgrade, add the transition work as a sub-thread, the 
> transition work will set the 

[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.

2019-05-07 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14465:
--
Status: Open  (was: Patch Available)

> When the Block expected replications is larger than the number of DataNodes, 
> entering maintenance will never exit.
> --
>
> Key: HDFS-14465
> URL: https://issues.apache.org/jira/browse/HDFS-14465
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
> Attachments: HDFS-14465.01.patch
>
>
> Scenes:
> There is a small HDFS cluster with 5 DataNodes; one of them is maintained, 
> added to the maintenance list, and set 
> dfs.namenode.maintenance.replication.min to 1.
> When refresh Nodes, the NameNode starts checking whether the blocks on the 
> node require a new replication.
> The replications of the MapReduce task job file is 10 by default, 
> isNeededReplicationForMaintenance will determine to false, and 
> isSufficientlyReplicated will determine to false, so the block of the job 
> file needs to increase the replication.
> When adding a replication, since the cluster has only 5 DataNodes, all the 
> nodes have the replications of the block, chooseTargetInOrder will throw a 
> NotEnoughReplicasException, so that the replication cannot be increase, and 
> the Entering Maintenance cannot be ended.
> This issue will cause the independent small cluster to be unable to use the 
> maintenance mode.
>  
> {panel:title=chooseTarget exception log}
> 2019-05-03 23:42:31,008 [31545331] - WARN  
> [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough 
> replicas, still in need of 1 to reach 5 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For 
> more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and 
> org.apache.hadoop.net.NetworkTopology
> {panel}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.

2019-05-07 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14465:
--
  Attachment: HDFS-14465.01.patch
Target Version/s: 3.3.0, 2.9.3
  Status: Patch Available  (was: Open)

> When the Block expected replications is larger than the number of DataNodes, 
> entering maintenance will never exit.
> --
>
> Key: HDFS-14465
> URL: https://issues.apache.org/jira/browse/HDFS-14465
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
> Attachments: HDFS-14465.01.patch
>
>
> Scenes:
> There is a small HDFS cluster with 5 DataNodes; one of them is maintained, 
> added to the maintenance list, and set 
> dfs.namenode.maintenance.replication.min to 1.
> When refresh Nodes, the NameNode starts checking whether the blocks on the 
> node require a new replication.
> The replications of the MapReduce task job file is 10 by default, 
> isNeededReplicationForMaintenance will determine to false, and 
> isSufficientlyReplicated will determine to false, so the block of the job 
> file needs to increase the replication.
> When adding a replication, since the cluster has only 5 DataNodes, all the 
> nodes have the replications of the block, chooseTargetInOrder will throw a 
> NotEnoughReplicasException, so that the replication cannot be increase, and 
> the Entering Maintenance cannot be ended.
> This issue will cause the independent small cluster to be unable to use the 
> maintenance mode.
>  
> {panel:title=chooseTarget exception log}
> 2019-05-03 23:42:31,008 [31545331] - WARN  
> [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough 
> replicas, still in need of 1 to reach 5 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For 
> more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and 
> org.apache.hadoop.net.NetworkTopology
> {panel}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.

2019-05-07 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14465:
--
Fix Version/s: (was: 2.9.3)
   (was: 3.3.0)

> When the Block expected replications is larger than the number of DataNodes, 
> entering maintenance will never exit.
> --
>
> Key: HDFS-14465
> URL: https://issues.apache.org/jira/browse/HDFS-14465
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
>
> Scenes:
> There is a small HDFS cluster with 5 DataNodes; one of them is maintained, 
> added to the maintenance list, and set 
> dfs.namenode.maintenance.replication.min to 1.
> When refresh Nodes, the NameNode starts checking whether the blocks on the 
> node require a new replication.
> The replications of the MapReduce task job file is 10 by default, 
> isNeededReplicationForMaintenance will determine to false, and 
> isSufficientlyReplicated will determine to false, so the block of the job 
> file needs to increase the replication.
> When adding a replication, since the cluster has only 5 DataNodes, all the 
> nodes have the replications of the block, chooseTargetInOrder will throw a 
> NotEnoughReplicasException, so that the replication cannot be increase, and 
> the Entering Maintenance cannot be ended.
> This issue will cause the independent small cluster to be unable to use the 
> maintenance mode.
>  
> {panel:title=chooseTarget exception log}
> 2019-05-03 23:42:31,008 [31545331] - WARN  
> [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough 
> replicas, still in need of 1 to reach 5 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For 
> more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and 
> org.apache.hadoop.net.NetworkTopology
> {panel}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.

2019-05-07 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14465:
--
Fix Version/s: 2.9.3
   3.3.0
   Status: Patch Available  (was: Open)

> When the Block expected replications is larger than the number of DataNodes, 
> entering maintenance will never exit.
> --
>
> Key: HDFS-14465
> URL: https://issues.apache.org/jira/browse/HDFS-14465
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
> Fix For: 3.3.0, 2.9.3
>
>
> Scenes:
> There is a small HDFS cluster with 5 DataNodes; one of them is maintained, 
> added to the maintenance list, and set 
> dfs.namenode.maintenance.replication.min to 1.
> When refresh Nodes, the NameNode starts checking whether the blocks on the 
> node require a new replication.
> The replications of the MapReduce task job file is 10 by default, 
> isNeededReplicationForMaintenance will determine to false, and 
> isSufficientlyReplicated will determine to false, so the block of the job 
> file needs to increase the replication.
> When adding a replication, since the cluster has only 5 DataNodes, all the 
> nodes have the replications of the block, chooseTargetInOrder will throw a 
> NotEnoughReplicasException, so that the replication cannot be increase, and 
> the Entering Maintenance cannot be ended.
> This issue will cause the independent small cluster to be unable to use the 
> maintenance mode.
>  
> {panel:title=chooseTarget exception log}
> 2019-05-03 23:42:31,008 [31545331] - WARN  
> [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough 
> replicas, still in need of 1 to reach 5 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For 
> more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and 
> org.apache.hadoop.net.NetworkTopology
> {panel}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.

2019-05-05 Thread Yicong Cai (JIRA)
Yicong Cai created HDFS-14465:
-

 Summary: When the Block expected replications is larger than the 
number of DataNodes, entering maintenance will never exit.
 Key: HDFS-14465
 URL: https://issues.apache.org/jira/browse/HDFS-14465
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.9.2
Reporter: Yicong Cai


Scenes:

There is a small HDFS cluster with 5 DataNodes; one of them is maintained, 
added to the maintenance list, and set dfs.namenode.maintenance.replication.min 
to 1.
When refresh Nodes, the NameNode starts checking whether the blocks on the node 
require a new replication.
The replications of the MapReduce task job file is 10 by default, 
isNeededReplicationForMaintenance will determine to false, and 
isSufficientlyReplicated will determine to false, so the block of the job file 
needs to increase the replication.
When adding a replication, since the cluster has only 5 DataNodes, all the 
nodes have the replications of the block, chooseTargetInOrder will throw a 
NotEnoughReplicasException, so that the replication cannot be increase, and the 
Entering Maintenance cannot be ended.

This issue will cause the independent small cluster to be unable to use the 
maintenance mode.

 
{panel:title=chooseTarget exception log}
2019-05-03 23:42:31,008 [31545331] - WARN  
[ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough 
replicas, still in need of 1 to reach 5 (unavailableStorages=[], 
storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more 
information, please enable DEBUG log level on 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and 
org.apache.hadoop.net.NetworkTopology
{panel}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission

2019-04-15 Thread Yicong Cai (JIRA)
Yicong Cai created HDFS-14429:
-

 Summary: Block remain in COMMITTED but not COMPLETE cause by 
Decommission
 Key: HDFS-14429
 URL: https://issues.apache.org/jira/browse/HDFS-14429
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.9.2
Reporter: Yicong Cai


In the following scenario, the Block will remain in the COMMITTED but not 
COMPLETE state and cannot be closed properly:
 # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
 # bk1 has been completely written to three data nodes, and the data node 
succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
 # The client commits bk1 after receiving the ACK.
 # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
enter Decommissioning.
 # The DN reports the IBR, but the block cannot be completed normally.

 

Then it will lead to the following related exceptions:
{panel:title=Exception}
2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
(FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= minimum 
= 1) in file xxx
2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - IPC 
Server handler 499 on 8020, call Call#122552 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
replicated yet: xxx
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
{panel}
And will cause the scenario described in HDFS-12747

The root cause is that addStoredBlock does not consider the case where the 
replications are in Decommission.
This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage

2019-02-22 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14311:
--
Attachment: HDFS-14311.1.patch

> multi-threading conflict at layoutVersion when loading block pool storage
> -
>
> Key: HDFS-14311
> URL: https://issues.apache.org/jira/browse/HDFS-14311
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rolling upgrades
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
> Fix For: 3.3.0, 2.9.3
>
> Attachments: HDFS-14311.1.patch
>
>
> When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at 
> StorageInfo.layoutVersion in loading block pool storage process.
> It will cause this exception:
>  
> {panel:title=exceptions}
> 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] 
> - Restored 36974 block files from trash before the layout upgrade. These 
> blocks will be moved to the previous directory during the upgrade
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] 
> - Failed to analyze storage directories for block pool 
> BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748)
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed 
> to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block 
> pool BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748) 
> {panel}
>  
> root cause:
> BlockPoolSliceStorage instance is shared for all storage locations recover 
> transition. In BlockPoolSliceStorage.doTransition, it will read the old 
> layoutVersion from local storage, compare with current DataNode version, then 
> do upgrade. In doUpgrade, add the transition work as a sub-thread, the 
> transition work will set the BlockPoolSliceStorage's layoutVersion to current 
> DN version. The next storage dir transition check will concurrent with pre 
> 

[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage

2019-02-22 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14311:
--
Status: Open  (was: Patch Available)

> multi-threading conflict at layoutVersion when loading block pool storage
> -
>
> Key: HDFS-14311
> URL: https://issues.apache.org/jira/browse/HDFS-14311
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rolling upgrades
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
> Fix For: 3.3.0, 2.9.3
>
>
> When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at 
> StorageInfo.layoutVersion in loading block pool storage process.
> It will cause this exception:
>  
> {panel:title=exceptions}
> 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] 
> - Restored 36974 block files from trash before the layout upgrade. These 
> blocks will be moved to the previous directory during the upgrade
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] 
> - Failed to analyze storage directories for block pool 
> BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748)
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed 
> to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block 
> pool BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748) 
> {panel}
>  
> root cause:
> BlockPoolSliceStorage instance is shared for all storage locations recover 
> transition. In BlockPoolSliceStorage.doTransition, it will read the old 
> layoutVersion from local storage, compare with current DataNode version, then 
> do upgrade. In doUpgrade, add the transition work as a sub-thread, the 
> transition work will set the BlockPoolSliceStorage's layoutVersion to current 
> DN version. The next storage dir transition check will concurrent with pre 
> storage dir real transition work, then the 

[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage

2019-02-22 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14311:
--
Fix Version/s: 2.9.3
   3.3.0
   Status: Patch Available  (was: Open)

> multi-threading conflict at layoutVersion when loading block pool storage
> -
>
> Key: HDFS-14311
> URL: https://issues.apache.org/jira/browse/HDFS-14311
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rolling upgrades
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
> Fix For: 3.3.0, 2.9.3
>
>
> When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at 
> StorageInfo.layoutVersion in loading block pool storage process.
> It will cause this exception:
>  
> {panel:title=exceptions}
> 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] 
> - Restored 36974 block files from trash before the layout upgrade. These 
> blocks will be moved to the previous directory during the upgrade
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] 
> - Failed to analyze storage directories for block pool 
> BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748)
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed 
> to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block 
> pool BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748) 
> {panel}
>  
> root cause:
> BlockPoolSliceStorage instance is shared for all storage locations recover 
> transition. In BlockPoolSliceStorage.doTransition, it will read the old 
> layoutVersion from local storage, compare with current DataNode version, then 
> do upgrade. In doUpgrade, add the transition work as a sub-thread, the 
> transition work will set the BlockPoolSliceStorage's layoutVersion to current 
> DN version. The next storage dir transition check will 

[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage

2019-02-22 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14311:
--
Attachment: (was: HDFS-14311.1.patch)

> multi-threading conflict at layoutVersion when loading block pool storage
> -
>
> Key: HDFS-14311
> URL: https://issues.apache.org/jira/browse/HDFS-14311
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rolling upgrades
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
>
> When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at 
> StorageInfo.layoutVersion in loading block pool storage process.
> It will cause this exception:
>  
> {panel:title=exceptions}
> 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] 
> - Restored 36974 block files from trash before the layout upgrade. These 
> blocks will be moved to the previous directory during the upgrade
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] 
> - Failed to analyze storage directories for block pool 
> BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748)
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed 
> to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block 
> pool BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748) 
> {panel}
>  
> root cause:
> BlockPoolSliceStorage instance is shared for all storage locations recover 
> transition. In BlockPoolSliceStorage.doTransition, it will read the old 
> layoutVersion from local storage, compare with current DataNode version, then 
> do upgrade. In doUpgrade, add the transition work as a sub-thread, the 
> transition work will set the BlockPoolSliceStorage's layoutVersion to current 
> DN version. The next storage dir transition check will concurrent with pre 
> storage dir real transition work, then the BlockPoolSliceStorage instance 

[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage

2019-02-22 Thread Yicong Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Cai updated HDFS-14311:
--
Attachment: HDFS-14311.1.patch

> multi-threading conflict at layoutVersion when loading block pool storage
> -
>
> Key: HDFS-14311
> URL: https://issues.apache.org/jira/browse/HDFS-14311
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rolling upgrades
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Priority: Major
>
> When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at 
> StorageInfo.layoutVersion in loading block pool storage process.
> It will cause this exception:
>  
> {panel:title=exceptions}
> 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] 
> - Restored 36974 block files from trash before the layout upgrade. These 
> blocks will be moved to the previous directory during the upgrade
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] 
> - Failed to analyze storage directories for block pool 
> BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748)
> 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed 
> to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block 
> pool BP-1216718839-10.120.232.23-1548736842023
> java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
> namespace state: LV = -63 CTime = 0
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
>  at java.lang.Thread.run(Thread.java:748) 
> {panel}
>  
> root cause:
> BlockPoolSliceStorage instance is shared for all storage locations recover 
> transition. In BlockPoolSliceStorage.doTransition, it will read the old 
> layoutVersion from local storage, compare with current DataNode version, then 
> do upgrade. In doUpgrade, add the transition work as a sub-thread, the 
> transition work will set the BlockPoolSliceStorage's layoutVersion to current 
> DN version. The next storage dir transition check will concurrent with pre 
> storage dir real transition work, then the BlockPoolSliceStorage instance 
> 

[jira] [Created] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage

2019-02-21 Thread Yicong Cai (JIRA)
Yicong Cai created HDFS-14311:
-

 Summary: multi-threading conflict at layoutVersion when loading 
block pool storage
 Key: HDFS-14311
 URL: https://issues.apache.org/jira/browse/HDFS-14311
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: rolling upgrades
Affects Versions: 2.9.2
Reporter: Yicong Cai


When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at 
StorageInfo.layoutVersion in loading block pool storage process.

It will cause this exception:

 
{panel:title=exceptions}
2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] - 
Restored 36974 block files from trash before the layout upgrade. These blocks 
will be moved to the previous directory during the upgrade
2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] - 
Failed to analyze storage directories for block pool 
BP-1216718839-10.120.232.23-1548736842023
java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
namespace state: LV = -63 CTime = 0
 at 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
 at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
 at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
 at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
 at 
org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
 at 
org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
 at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
 at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
 at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
 at java.lang.Thread.run(Thread.java:748)
2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed to 
add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block pool 
BP-1216718839-10.120.232.23-1548736842023
java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the 
namespace state: LV = -63 CTime = 0
 at 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250)
 at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460)
 at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390)
 at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)
 at 
org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649)
 at 
org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610)
 at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388)
 at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280)
 at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
 at java.lang.Thread.run(Thread.java:748) 
{panel}
 

root cause:

BlockPoolSliceStorage instance is shared for all storage locations recover 
transition. In BlockPoolSliceStorage.doTransition, it will read the old 
layoutVersion from local storage, compare with current DataNode version, then 
do upgrade. In doUpgrade, add the transition work as a sub-thread, the 
transition work will set the BlockPoolSliceStorage's layoutVersion to current 
DN version. The next storage dir transition check will concurrent with pre 
storage dir real transition work, then the BlockPoolSliceStorage instance 
layoutVersion will confusion.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org