[jira] [Comment Edited] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt
[ https://issues.apache.org/jira/browse/HDFS-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065461#comment-17065461 ] zhengchenyu edited comment on HDFS-15237 at 3/27/20, 5:14 AM: -- Here comment some strange phenomenon. Why this error happen? Because I found some wrong internal block distribution like this: {code:java} 0. BP-1936287042-10.200.128.33-1573194961291:blk_-9223372036797335472_3783205 len=247453749 Live_repl=8 [blk_-9223372036797335472:DatanodeInfoWithStorage[10.200.128.43:9866,DS-2ddde0b8-6a84-4d06-8a40-d4ae5691e81c,DISK], blk_-9223372036797335471:DatanodeInfoWithStorage[10.200.128.41:9866,DS-a4fc5486-6c45-481e-84e7-9393eeaf1313,DISK], blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.50:9866,DS-fc0632c6-8916-42d8-8219-57b022bb2786,DISK], blk_-9223372036797335469:DatanodeInfoWithStorage[10.200.128.54:9866,DS-1b6cb52a-f55a-4ef8-beaf-a5d7b7fe93aa,DISK], blk_-9223372036797335467:DatanodeInfoWithStorage[10.200.128.52:9866,DS-fc6e00dd-ca5a-4580-9403-aeb6906da81a,DISK], blk_-9223372036797335466:DatanodeInfoWithStorage[10.200.128.53:9866,DS-2c926a3b-64c0-441b-abe2-188e79918abe,DISK], blk_-9223372036797335465:DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK], blk_-9223372036797335464:DatanodeInfoWithStorage[10.200.128.44:9866,DS-3725af76-fe86-4f97-9740-d77bfa339b3f,DISK], blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.45:9866,DS-250fd4cf-705f-4cb5-bc3a-c7a105247e35,DISK]] {code} this is the result of hdfs fsck. Your can see this block group has 9 internal block, but no blk_-9223372036797335468, two repeated blk_-9223372036797335470. Through the distribution is wrong, the blcok group has 9 internal block, so the replicatorMonitor can't repair this error. I think this may be another issue. ([~gjhkael] remid HDFS-14754 is the relative issue.) this block is too old so that the log is missing, so I don't know the reason, and can't reproduction this error now. was (Author: zhengchenyu): Here comment some strange phenomenon. Why this error happen? Because I found some wrong internal block distribution like this: {code:java} 0. BP-1936287042-10.200.128.33-1573194961291:blk_-9223372036797335472_3783205 len=247453749 Live_repl=8 [blk_-9223372036797335472:DatanodeInfoWithStorage[10.200.128.43:9866,DS-2ddde0b8-6a84-4d06-8a40-d4ae5691e81c,DISK], blk_-9223372036797335471:DatanodeInfoWithStorage[10.200.128.41:9866,DS-a4fc5486-6c45-481e-84e7-9393eeaf1313,DISK], blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.50:9866,DS-fc0632c6-8916-42d8-8219-57b022bb2786,DISK], blk_-9223372036797335469:DatanodeInfoWithStorage[10.200.128.54:9866,DS-1b6cb52a-f55a-4ef8-beaf-a5d7b7fe93aa,DISK], blk_-9223372036797335467:DatanodeInfoWithStorage[10.200.128.52:9866,DS-fc6e00dd-ca5a-4580-9403-aeb6906da81a,DISK], blk_-9223372036797335466:DatanodeInfoWithStorage[10.200.128.53:9866,DS-2c926a3b-64c0-441b-abe2-188e79918abe,DISK], blk_-9223372036797335465:DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK], blk_-9223372036797335464:DatanodeInfoWithStorage[10.200.128.44:9866,DS-3725af76-fe86-4f97-9740-d77bfa339b3f,DISK], blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.45:9866,DS-250fd4cf-705f-4cb5-bc3a-c7a105247e35,DISK]] {code} this is the result of hdfs fsck. Your can see this block group has 9 internal block, but no blk_-9223372036797335468, two repeated blk_-9223372036797335470. Through the distribution is wrong, the blcok group has 9 internal block, so the replicatorMonitor can't repair this error. I think this may be another issue. ([~gjhkael] remid HDFS-14754 is the relative issue.) this block is too old so that the log is missing, so I don't know the reason, and can't reproduction this error now. > Get checksum of EC file failed, when some block is missing or corrupt > - > > Key: HDFS-15237 > URL: https://issues.apache.org/jira/browse/HDFS-15237 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, hdfs >Affects Versions: 3.2.1 >Reporter: zhengchenyu >Priority: Major > Fix For: 3.2.2 > > > When we distcp from an ec directory to another one, I found some error like > this. > {code} > 2020-03-20 20:18:21,366 WARN [main] > org.apache.hadoop.hdfs.FileChecksumHelper: src=/EC/6-3//000325_0, > datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20 > 20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: > src=/EC/6-3//000325_0, > datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException: > Unexpected EOF while trying
[jira] [Comment Edited] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt
[ https://issues.apache.org/jira/browse/HDFS-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065461#comment-17065461 ] zhengchenyu edited comment on HDFS-15237 at 3/27/20, 5:13 AM: -- Here comment some strange phenomenon. Why this error happen? Because I found some wrong internal block distribution like this: {code:java} 0. BP-1936287042-10.200.128.33-1573194961291:blk_-9223372036797335472_3783205 len=247453749 Live_repl=8 [blk_-9223372036797335472:DatanodeInfoWithStorage[10.200.128.43:9866,DS-2ddde0b8-6a84-4d06-8a40-d4ae5691e81c,DISK], blk_-9223372036797335471:DatanodeInfoWithStorage[10.200.128.41:9866,DS-a4fc5486-6c45-481e-84e7-9393eeaf1313,DISK], blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.50:9866,DS-fc0632c6-8916-42d8-8219-57b022bb2786,DISK], blk_-9223372036797335469:DatanodeInfoWithStorage[10.200.128.54:9866,DS-1b6cb52a-f55a-4ef8-beaf-a5d7b7fe93aa,DISK], blk_-9223372036797335467:DatanodeInfoWithStorage[10.200.128.52:9866,DS-fc6e00dd-ca5a-4580-9403-aeb6906da81a,DISK], blk_-9223372036797335466:DatanodeInfoWithStorage[10.200.128.53:9866,DS-2c926a3b-64c0-441b-abe2-188e79918abe,DISK], blk_-9223372036797335465:DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK], blk_-9223372036797335464:DatanodeInfoWithStorage[10.200.128.44:9866,DS-3725af76-fe86-4f97-9740-d77bfa339b3f,DISK], blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.45:9866,DS-250fd4cf-705f-4cb5-bc3a-c7a105247e35,DISK]] {code} this is the result of hdfs fsck. Your can see this block group has 9 internal block, but no blk_-9223372036797335468, two repeated blk_-9223372036797335470. Through the distribution is wrong, the blcok group has 9 internal block, so the replicatorMonitor can't repair this error. I think this may be another issue. ([~gjhkael] remid HDFS-14754 is the relative issue.) this block is too old so that the log is missing, so I don't know the reason, and can't reproduction this error now. was (Author: zhengchenyu): Here comment some strange phenomenon. Why this error happen? Because I found some wrong internal block distribution like this: {code:java} 0. BP-1936287042-10.200.128.33-1573194961291:blk_-9223372036797335472_3783205 len=247453749 Live_repl=8 [blk_-9223372036797335472:DatanodeInfoWithStorage[10.200.128.43:9866,DS-2ddde0b8-6a84-4d06-8a40-d4ae5691e81c,DISK], blk_-9223372036797335471:DatanodeInfoWithStorage[10.200.128.41:9866,DS-a4fc5486-6c45-481e-84e7-9393eeaf1313,DISK], blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.50:9866,DS-fc0632c6-8916-42d8-8219-57b022bb2786,DISK], blk_-9223372036797335469:DatanodeInfoWithStorage[10.200.128.54:9866,DS-1b6cb52a-f55a-4ef8-beaf-a5d7b7fe93aa,DISK], blk_-9223372036797335467:DatanodeInfoWithStorage[10.200.128.52:9866,DS-fc6e00dd-ca5a-4580-9403-aeb6906da81a,DISK], blk_-9223372036797335466:DatanodeInfoWithStorage[10.200.128.53:9866,DS-2c926a3b-64c0-441b-abe2-188e79918abe,DISK], blk_-9223372036797335465:DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK], blk_-9223372036797335464:DatanodeInfoWithStorage[10.200.128.44:9866,DS-3725af76-fe86-4f97-9740-d77bfa339b3f,DISK], blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.45:9866,DS-250fd4cf-705f-4cb5-bc3a-c7a105247e35,DISK]] {code} this is the result of hdfs fsck. Your can see this block group has 9 internal block, but no blk_-9223372036797335468, two repeated blk_-9223372036797335470. Through the distribution is wrong, the blcok group has 9 internal block, so the replicatorMonitor can't repair this error. I think this may be another issue. this block is too old so that the log is missing, so I don't know the reason, and can't reproduction this error now. > Get checksum of EC file failed, when some block is missing or corrupt > - > > Key: HDFS-15237 > URL: https://issues.apache.org/jira/browse/HDFS-15237 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, hdfs >Affects Versions: 3.2.1 >Reporter: zhengchenyu >Priority: Major > Fix For: 3.2.2 > > > When we distcp from an ec directory to another one, I found some error like > this. > {code} > 2020-03-20 20:18:21,366 WARN [main] > org.apache.hadoop.hdfs.FileChecksumHelper: src=/EC/6-3//000325_0, > datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20 > 20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: > src=/EC/6-3//000325_0, > datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException: > Unexpected EOF while trying to read response from server at >
[jira] [Commented] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt
[ https://issues.apache.org/jira/browse/HDFS-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068275#comment-17068275 ] zhengchenyu commented on HDFS-15237: [~gjhkael] Oh, I see. I review HDFS-14754, it just slove my problem. Thank you! > Get checksum of EC file failed, when some block is missing or corrupt > - > > Key: HDFS-15237 > URL: https://issues.apache.org/jira/browse/HDFS-15237 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, hdfs >Affects Versions: 3.2.1 >Reporter: zhengchenyu >Priority: Major > Fix For: 3.2.2 > > > When we distcp from an ec directory to another one, I found some error like > this. > {code} > 2020-03-20 20:18:21,366 WARN [main] > org.apache.hadoop.hdfs.FileChecksumHelper: src=/EC/6-3//000325_0, > datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20 > 20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: > src=/EC/6-3//000325_0, > datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException: > Unexpected EOF while trying to read response from server at > org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:550) > at > org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.tryDatanode(FileChecksumHelper.java:709) > at > org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlockGroup(FileChecksumHelper.java:664) > at > org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlocks(FileChecksumHelper.java:638) > at > org.apache.hadoop.hdfs.FileChecksumHelper$FileChecksumComputer.compute(FileChecksumHelper.java:252) > at > org.apache.hadoop.hdfs.DFSClient.getFileChecksumInternal(DFSClient.java:1790) > at > org.apache.hadoop.hdfs.DFSClient.getFileChecksumWithCombineMode(DFSClient.java:1810) > at > org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1691) > at > org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1688) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:1700) > at > org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:138) > at > org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:115) > at > org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87) > at > org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:259) > at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:220) at > org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48) at > org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at > org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) > {code} > And Then I found some error in datanode like this > {code} > 2020-03-20 20:54:16,573 INFO > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: > SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = > false > 2020-03-20 20:54:16,577 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > bd-hadoop-128050.zeus.lianjia.com:9866:DataXceiver error processing > BLOCK_GROUP_CHECKSUM operation src: /10.201.1.38:33264 dst: > /10.200.128.50:9866 > java.lang.UnsupportedOperationException > at java.nio.ByteBuffer.array(ByteBuffer.java:994) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockChecksumReconstructor.reconstruct(StripedBlockChecksumReconstructor.java:90) > at > org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.recalculateChecksum(BlockChecksumHelper.java:711) > at > org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.compute(BlockChecksumHelper.java:489) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.blockGroupChecksum(DataXceiver.java:1047) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opStripedBlockChecksum(Receiver.java:327) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:119)
[jira] [Comment Edited] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt
[ https://issues.apache.org/jira/browse/HDFS-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068263#comment-17068263 ] guojh edited comment on HDFS-15237 at 3/27/20, 4:42 AM: [~zhengchenyu] Have you try this patch [HDFS-14754|https://issues.apache.org/jira/browse/HDFS-14754] ? Looks like the redundant block has not been deleted first. was (Author: gjhkael): [~zhengchenyu] Have you try this patch (https://issues.apache.org/jira/browse/HDFS-14754) > Get checksum of EC file failed, when some block is missing or corrupt > - > > Key: HDFS-15237 > URL: https://issues.apache.org/jira/browse/HDFS-15237 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, hdfs >Affects Versions: 3.2.1 >Reporter: zhengchenyu >Priority: Major > Fix For: 3.2.2 > > > When we distcp from an ec directory to another one, I found some error like > this. > {code} > 2020-03-20 20:18:21,366 WARN [main] > org.apache.hadoop.hdfs.FileChecksumHelper: src=/EC/6-3//000325_0, > datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20 > 20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: > src=/EC/6-3//000325_0, > datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException: > Unexpected EOF while trying to read response from server at > org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:550) > at > org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.tryDatanode(FileChecksumHelper.java:709) > at > org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlockGroup(FileChecksumHelper.java:664) > at > org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlocks(FileChecksumHelper.java:638) > at > org.apache.hadoop.hdfs.FileChecksumHelper$FileChecksumComputer.compute(FileChecksumHelper.java:252) > at > org.apache.hadoop.hdfs.DFSClient.getFileChecksumInternal(DFSClient.java:1790) > at > org.apache.hadoop.hdfs.DFSClient.getFileChecksumWithCombineMode(DFSClient.java:1810) > at > org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1691) > at > org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1688) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:1700) > at > org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:138) > at > org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:115) > at > org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87) > at > org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:259) > at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:220) at > org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48) at > org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at > org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) > {code} > And Then I found some error in datanode like this > {code} > 2020-03-20 20:54:16,573 INFO > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: > SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = > false > 2020-03-20 20:54:16,577 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > bd-hadoop-128050.zeus.lianjia.com:9866:DataXceiver error processing > BLOCK_GROUP_CHECKSUM operation src: /10.201.1.38:33264 dst: > /10.200.128.50:9866 > java.lang.UnsupportedOperationException > at java.nio.ByteBuffer.array(ByteBuffer.java:994) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockChecksumReconstructor.reconstruct(StripedBlockChecksumReconstructor.java:90) > at > org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.recalculateChecksum(BlockChecksumHelper.java:711) > at > org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.compute(BlockChecksumHelper.java:489) > at >
[jira] [Commented] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt
[ https://issues.apache.org/jira/browse/HDFS-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068263#comment-17068263 ] guojh commented on HDFS-15237: -- [~zhengchenyu] Have you try this patch (https://issues.apache.org/jira/browse/HDFS-14754) > Get checksum of EC file failed, when some block is missing or corrupt > - > > Key: HDFS-15237 > URL: https://issues.apache.org/jira/browse/HDFS-15237 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, hdfs >Affects Versions: 3.2.1 >Reporter: zhengchenyu >Priority: Major > Fix For: 3.2.2 > > > When we distcp from an ec directory to another one, I found some error like > this. > {code} > 2020-03-20 20:18:21,366 WARN [main] > org.apache.hadoop.hdfs.FileChecksumHelper: src=/EC/6-3//000325_0, > datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20 > 20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: > src=/EC/6-3//000325_0, > datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException: > Unexpected EOF while trying to read response from server at > org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:550) > at > org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.tryDatanode(FileChecksumHelper.java:709) > at > org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlockGroup(FileChecksumHelper.java:664) > at > org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlocks(FileChecksumHelper.java:638) > at > org.apache.hadoop.hdfs.FileChecksumHelper$FileChecksumComputer.compute(FileChecksumHelper.java:252) > at > org.apache.hadoop.hdfs.DFSClient.getFileChecksumInternal(DFSClient.java:1790) > at > org.apache.hadoop.hdfs.DFSClient.getFileChecksumWithCombineMode(DFSClient.java:1810) > at > org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1691) > at > org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1688) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:1700) > at > org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:138) > at > org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:115) > at > org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87) > at > org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:259) > at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:220) at > org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48) at > org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at > org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) > {code} > And Then I found some error in datanode like this > {code} > 2020-03-20 20:54:16,573 INFO > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: > SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = > false > 2020-03-20 20:54:16,577 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > bd-hadoop-128050.zeus.lianjia.com:9866:DataXceiver error processing > BLOCK_GROUP_CHECKSUM operation src: /10.201.1.38:33264 dst: > /10.200.128.50:9866 > java.lang.UnsupportedOperationException > at java.nio.ByteBuffer.array(ByteBuffer.java:994) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockChecksumReconstructor.reconstruct(StripedBlockChecksumReconstructor.java:90) > at > org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.recalculateChecksum(BlockChecksumHelper.java:711) > at > org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.compute(BlockChecksumHelper.java:489) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.blockGroupChecksum(DataXceiver.java:1047) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opStripedBlockChecksum(Receiver.java:327) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:119) >
[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068170#comment-17068170 ] Hadoop QA commented on HDFS-15240: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 33s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 13s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 21m 6s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 7m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 42s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 26s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 20s{color} | {color:red} hadoop-hdfs-client in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 1m 42s{color} | {color:red} root in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 1m 42s{color} | {color:red} root in the patch failed. {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 47s{color} | {color:orange} The patch fails to run checkstyle in root {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 21s{color} | {color:red} hadoop-hdfs-client in the patch failed. {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} shadedclient {color} | {color:red} 3m 0s{color} | {color:red} patch has errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 19s{color} | {color:red} hadoop-hdfs-client in the patch failed. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 16s{color} | {color:red} hadoop-hdfs-client in the patch failed. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 49s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 19s{color} | {color:red} hadoop-hdfs-client in the patch failed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 91m 16s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 35s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}188m 1s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:4454c6d14b7 | | JIRA Issue | HDFS-15240 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12997870/HDFS-15240.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 4d271edefa4e 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6
[jira] [Commented] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl
[ https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068119#comment-17068119 ] Íñigo Goiri commented on HDFS-15242: +1 on [^HDFS-15242.003.patch]. > Add metrics for operations hold lock times of FsDatasetImpl > --- > > Key: HDFS-15242 > URL: https://issues.apache.org/jira/browse/HDFS-15242 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-15242.001.patch, HDFS-15242.002.patch, > HDFS-15242.003.patch > > > Some operations of FsDatasetImpl need to hold Lock, and sometimes it costs > long time to execute since it include IO operation in Lock. I propose to add > metrics for this operations then it could be more convenient for monitor and > dig bottleneck. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl
[ https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068091#comment-17068091 ] Hadoop QA commented on HDFS-15242: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 51s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 10s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 20m 34s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 39s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 21s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 15m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 4s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 39s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 7s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}115m 7s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 53s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}239m 4s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks | | | hadoop.hdfs.server.balancer.TestBalancer | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:4454c6d14b7 | | JIRA Issue | HDFS-15242 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12997922/HDFS-15242.003.patch | | Optional Tests | dupname asflicense mvnsite compile javac javadoc mvninstall unit shadedclient findbugs checkstyle | | uname | Linux 2742ac603bf8 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 745a6c1 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_242 | | findbugs | v3.1.0-RC1 | | unit |
[jira] [Updated] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-15240: --- Status: Patch Available (was: Open) > Erasure Coding: dirty buffer causes reconstruction block error > -- > > Key: HDFS-15240 > URL: https://issues.apache.org/jira/browse/HDFS-15240 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Reporter: HuangTao >Assignee: HuangTao >Priority: Major > Attachments: HDFS-15240.001.patch > > > When read some lzo files we found some blocks were broken. > I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from > DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') > blocks. And find the longest common sequenece(LCS) between b6'(decoded) and > b6(read from DN)(b7'/b7 and b8'/b8). > After selecting 6 blocks of the block group in combinations one time and > iterating through all cases, I find one case that the length of LCS is the > block length - 64KB, 64KB is just the length of ByteBuffer used by > StripedBlockReader. So the corrupt reconstruction block is made by a dirty > buffer. > The following log snippet(only show 2 of 28 cases) is my check program > output. In my case, I known the 3th block is corrupt, so need other 5 blocks > to decode another 3 blocks, then find the 1th block's LCS substring is block > length - 64kb. > It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the > dirty buffer was used before read the 1th block. > Must be noted that StripedBlockReader read from the offset 0 of the 1th block > after used the dirty buffer. > {code:java} > decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] > Check Block(1) first 131072 bytes longest common substring length 4 > Check Block(6) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4 > decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] > Check Block(1) first 131072 bytes longest common substring length 65536 > CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length > 27197440 # this one > Check Block(7) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4{code} > Now I know the dirty buffer causes reconstruction block error, but how does > the dirty buffer come about? > After digging into the code and DN log, I found this following DN log is the > root reason. > {code:java} > [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel > java.nio.channels.SocketChannel[connected local=/:52586 > remote=/:50010]. 18 millis timeout left. > [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped > block: BP-714356632--1519726836856:blk_-YY_3472979393 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) {code} > Reading from DN may timeout(hold by a future(F)) and output the INFO log, but > the futures that contains the future(F) is cleared, > {code:java} > return new StripingChunkReadResult(futures.remove(future), > StripingChunkReadResult.CANCELLED); {code} > futures.remove(future) cause NPE. So the EC reconstruction is failed. In the > finally phase, the code snippet in *getStripedReader().close()* > {code:java} > reconstructor.freeBuffer(reader.getReadBuffer()); > reader.freeReadBuffer(); > reader.closeBlockReader(); {code} > free buffer firstly, but the StripedBlockReader still holds the buffer and > write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For
[jira] [Commented] (HDFS-15169) RBF: Router FSCK should consider the mount table
[ https://issues.apache.org/jira/browse/HDFS-15169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067989#comment-17067989 ] Íñigo Goiri commented on HDFS-15169: Thanks [~hexiaoqiao] for the unit test, it covers my concerns. I'm thinking about refactoring the main code to: {code} /** * Get path location using path parameter of fsck request from client. */ private Set getNameSpacesForPath(String[] paths) { if (paths == null || paths.length == 0) { return null; } String path = paths[0]; PathLocation pathLocation = router.getSubclusterResolver().getDestinationForPath(path); return pathLocation.getNamespaces(); } /** * Redirect the request to certain active downstream NameNode if resolve * target namespace otherwise redirect the requests to all active * downstream NameNodes. */ private List getMembershipsForPath( final Set nss, final List memberships) { Set nss = getNameSpacesForPath(paths); if (nss == null || nss.isEmpty()) { return memberships; } List targetMemberships = new ArrayList<>(); for (String ns : nss) { for (MembershipState ms : memberships) { if (ms.getState() == FederationNamenodeServiceState.ACTIVE && ns.equals(ms.getNameserviceId())) { targetMemberships.add(ms); } } } return targetMemberships; } public void fsck() { ... String[] paths = pmap.get("path"); List targetMemberships = getMembershipsForPath( paths, memberships); for (MembershipState nn : targetMemberships) { ... } {code} > RBF: Router FSCK should consider the mount table > > > Key: HDFS-15169 > URL: https://issues.apache.org/jira/browse/HDFS-15169 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: Akira Ajisaka >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-15169.001.patch, HDFS-15169.002.patch > > > HDFS-13989 implemented FSCK to DFSRouter, however, it just redirects the > requests to all the active downstream NameNodes for now. The DFSRouter should > consider the mount table when redirecting the requests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15169) RBF: Router FSCK should consider the mount table
[ https://issues.apache.org/jira/browse/HDFS-15169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067989#comment-17067989 ] Íñigo Goiri edited comment on HDFS-15169 at 3/26/20, 7:52 PM: -- Thanks [~hexiaoqiao] for the unit test, it covers my concerns. I'm thinking about refactoring the main code to: {code} /** * Get path location using path parameter of fsck request from client. */ private Set getNameSpacesForPath(String[] paths) { if (paths == null || paths.length == 0) { return null; } String path = paths[0]; PathLocation pathLocation = router.getSubclusterResolver().getDestinationForPath(path); return pathLocation.getNamespaces(); } /** * Redirect the request to certain active downstream NameNode if resolve * target namespace otherwise redirect the requests to all active * downstream NameNodes. */ private List getMembershipsForPath( final Set nss, final List memberships) { Set nss = getNameSpacesForPath(paths); if (nss == null || nss.isEmpty()) { return memberships; } List targetMemberships = new ArrayList<>(); for (String ns : nss) { for (MembershipState ms : memberships) { if (ms.getState() == FederationNamenodeServiceState.ACTIVE && ns.equals(ms.getNameserviceId())) { targetMemberships.add(ms); } } } return targetMemberships; } public void fsck() { ... String[] paths = pmap.get("path"); List targetMemberships = getMembershipsForPath( paths, memberships); for (MembershipState nn : targetMemberships) { ... } {code} BTW, let's fix the checkstyles and the failed unit test might be related. was (Author: elgoiri): Thanks [~hexiaoqiao] for the unit test, it covers my concerns. I'm thinking about refactoring the main code to: {code} /** * Get path location using path parameter of fsck request from client. */ private Set getNameSpacesForPath(String[] paths) { if (paths == null || paths.length == 0) { return null; } String path = paths[0]; PathLocation pathLocation = router.getSubclusterResolver().getDestinationForPath(path); return pathLocation.getNamespaces(); } /** * Redirect the request to certain active downstream NameNode if resolve * target namespace otherwise redirect the requests to all active * downstream NameNodes. */ private List getMembershipsForPath( final Set nss, final List memberships) { Set nss = getNameSpacesForPath(paths); if (nss == null || nss.isEmpty()) { return memberships; } List targetMemberships = new ArrayList<>(); for (String ns : nss) { for (MembershipState ms : memberships) { if (ms.getState() == FederationNamenodeServiceState.ACTIVE && ns.equals(ms.getNameserviceId())) { targetMemberships.add(ms); } } } return targetMemberships; } public void fsck() { ... String[] paths = pmap.get("path"); List targetMemberships = getMembershipsForPath( paths, memberships); for (MembershipState nn : targetMemberships) { ... } {code} > RBF: Router FSCK should consider the mount table > > > Key: HDFS-15169 > URL: https://issues.apache.org/jira/browse/HDFS-15169 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: Akira Ajisaka >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-15169.001.patch, HDFS-15169.002.patch > > > HDFS-13989 implemented FSCK to DFSRouter, however, it just redirects the > requests to all the active downstream NameNodes for now. The DFSRouter should > consider the mount table when redirecting the requests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl
[ https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoqiao He updated HDFS-15242: --- Attachment: HDFS-15242.003.patch > Add metrics for operations hold lock times of FsDatasetImpl > --- > > Key: HDFS-15242 > URL: https://issues.apache.org/jira/browse/HDFS-15242 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-15242.001.patch, HDFS-15242.002.patch, > HDFS-15242.003.patch > > > Some operations of FsDatasetImpl need to hold Lock, and sometimes it costs > long time to execute since it include IO operation in Lock. I propose to add > metrics for this operations then it could be more convenient for monitor and > dig bottleneck. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl
[ https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067915#comment-17067915 ] Xiaoqiao He commented on HDFS-15242: Thanks [~elgoiri] for your reviews. v003 fix the typo. PTAL. > Add metrics for operations hold lock times of FsDatasetImpl > --- > > Key: HDFS-15242 > URL: https://issues.apache.org/jira/browse/HDFS-15242 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-15242.001.patch, HDFS-15242.002.patch, > HDFS-15242.003.patch > > > Some operations of FsDatasetImpl need to hold Lock, and sometimes it costs > long time to execute since it include IO operation in Lock. I propose to add > metrics for this operations then it could be more convenient for monitor and > dig bottleneck. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier
[ https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067897#comment-17067897 ] Chen Liang commented on HDFS-15191: --- Hey [~Steven Rand], Sorry I did plan to take another look, but have been busy recently. Will take a look today or tomorrow > EOF when reading legacy buffer in BlockTokenIdentifier > -- > > Key: HDFS-15191 > URL: https://issues.apache.org/jira/browse/HDFS-15191 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.2.1 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, > HDFS-15191.003.patch, HDFS-15191.004.patch > > > We have an HDFS client application which recently upgraded from 3.2.0 to > 3.2.1. After this upgrade (but not before), we sometimes see these errors > when this application is used with clusters still running Hadoop 2.x (more > specifically CDH 5.12.1): > {code} > WARN [2020-02-24T00:54:32.856Z] > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing > remote block reader. (_sampled: true) > java.io.EOFException: > at java.io.DataInputStream.readByte(DataInputStream.java:272) > at > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) > at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221) > at > org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730) > at > org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644) > at > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246) > at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765) > {code} > We get this warning for all DataNodes with a copy of the block, so the read > fails. > I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to > cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging > [~vagarychen] in case you have any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067871#comment-17067871 ] HuangTao commented on HDFS-15240: - I will try to add an UT > Erasure Coding: dirty buffer causes reconstruction block error > -- > > Key: HDFS-15240 > URL: https://issues.apache.org/jira/browse/HDFS-15240 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Reporter: HuangTao >Assignee: HuangTao >Priority: Major > Attachments: HDFS-15240.001.patch > > > When read some lzo files we found some blocks were broken. > I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from > DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') > blocks. And find the longest common sequenece(LCS) between b6'(decoded) and > b6(read from DN)(b7'/b7 and b8'/b8). > After selecting 6 blocks of the block group in combinations one time and > iterating through all cases, I find one case that the length of LCS is the > block length - 64KB, 64KB is just the length of ByteBuffer used by > StripedBlockReader. So the corrupt reconstruction block is made by a dirty > buffer. > The following log snippet(only show 2 of 28 cases) is my check program > output. In my case, I known the 3th block is corrupt, so need other 5 blocks > to decode another 3 blocks, then find the 1th block's LCS substring is block > length - 64kb. > It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the > dirty buffer was used before read the 1th block. > Must be noted that StripedBlockReader read from the offset 0 of the 1th block > after used the dirty buffer. > {code:java} > decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] > Check Block(1) first 131072 bytes longest common substring length 4 > Check Block(6) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4 > decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] > Check Block(1) first 131072 bytes longest common substring length 65536 > CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length > 27197440 # this one > Check Block(7) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4{code} > Now I know the dirty buffer causes reconstruction block error, but how does > the dirty buffer come about? > After digging into the code and DN log, I found this following DN log is the > root reason. > {code:java} > [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel > java.nio.channels.SocketChannel[connected local=/:52586 > remote=/:50010]. 18 millis timeout left. > [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped > block: BP-714356632--1519726836856:blk_-YY_3472979393 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) {code} > Reading from DN may timeout(hold by a future(F)) and output the INFO log, but > the futures that contains the future(F) is cleared, > {code:java} > return new StripingChunkReadResult(futures.remove(future), > StripingChunkReadResult.CANCELLED); {code} > futures.remove(future) cause NPE. So the EC reconstruction is failed. In the > finally phase, the code snippet in *getStripedReader().close()* > {code:java} > reconstructor.freeBuffer(reader.getReadBuffer()); > reader.freeReadBuffer(); > reader.closeBlockReader(); {code} > free buffer firstly, but the StripedBlockReader still holds the buffer and > write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
[jira] [Commented] (HDFS-13470) RBF: Add Browse the Filesystem button to the UI
[ https://issues.apache.org/jira/browse/HDFS-13470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067792#comment-17067792 ] Hudson commented on HDFS-13470: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18096 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18096/]) HDFS-13470. RBF: Add Browse the Filesystem button to the UI. (inigoiri: rev 679631b1885cffaf8fc8d2677da15152db139065) * (add) hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/webapps/router/explorer.js * (edit) hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/webapps/router/federationhealth.html * (add) hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/webapps/router/explorer.html > RBF: Add Browse the Filesystem button to the UI > --- > > Key: HDFS-13470 > URL: https://issues.apache.org/jira/browse/HDFS-13470 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Íñigo Goiri >Assignee: Íñigo Goiri >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-13470.000.patch, HDFS-13470.001.patch, > HDFS-13470.002.patch > > > After HDFS-12512 added WebHDFS, we can add the support to browse the > filesystem to the UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl
[ https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067776#comment-17067776 ] Íñigo Goiri commented on HDFS-15242: Good catch on the lock. I think addConvertTemporaryToTbwOp(), should be addConvertTemporaryToRbwOp(), right? > Add metrics for operations hold lock times of FsDatasetImpl > --- > > Key: HDFS-15242 > URL: https://issues.apache.org/jira/browse/HDFS-15242 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-15242.001.patch, HDFS-15242.002.patch > > > Some operations of FsDatasetImpl need to hold Lock, and sometimes it costs > long time to execute since it include IO operation in Lock. I propose to add > metrics for this operations then it could be more convenient for monitor and > dig bottleneck. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13470) RBF: Add Browse the Filesystem button to the UI
[ https://issues.apache.org/jira/browse/HDFS-13470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri updated HDFS-13470: --- Fix Version/s: 3.3.0 Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) > RBF: Add Browse the Filesystem button to the UI > --- > > Key: HDFS-13470 > URL: https://issues.apache.org/jira/browse/HDFS-13470 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Íñigo Goiri >Assignee: Íñigo Goiri >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-13470.000.patch, HDFS-13470.001.patch, > HDFS-13470.002.patch > > > After HDFS-12512 added WebHDFS, we can add the support to browse the > filesystem to the UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13470) RBF: Add Browse the Filesystem button to the UI
[ https://issues.apache.org/jira/browse/HDFS-13470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067770#comment-17067770 ] Íñigo Goiri commented on HDFS-13470: Thanks [~ayushtkn] for the review. Committed to trunk. > RBF: Add Browse the Filesystem button to the UI > --- > > Key: HDFS-13470 > URL: https://issues.apache.org/jira/browse/HDFS-13470 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Íñigo Goiri >Assignee: Íñigo Goiri >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-13470.000.patch, HDFS-13470.001.patch, > HDFS-13470.002.patch > > > After HDFS-12512 added WebHDFS, we can add the support to browse the > filesystem to the UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067682#comment-17067682 ] Fei Hui commented on HDFS-15240: [~marvelrock] Good Catch ! Thanks for reporting and fixing. Could you please add UT? > Erasure Coding: dirty buffer causes reconstruction block error > -- > > Key: HDFS-15240 > URL: https://issues.apache.org/jira/browse/HDFS-15240 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Reporter: HuangTao >Assignee: HuangTao >Priority: Major > Attachments: HDFS-15240.001.patch > > > When read some lzo files we found some blocks were broken. > I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from > DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') > blocks. And find the longest common sequenece(LCS) between b6'(decoded) and > b6(read from DN)(b7'/b7 and b8'/b8). > After selecting 6 blocks of the block group in combinations one time and > iterating through all cases, I find one case that the length of LCS is the > block length - 64KB, 64KB is just the length of ByteBuffer used by > StripedBlockReader. So the corrupt reconstruction block is made by a dirty > buffer. > The following log snippet(only show 2 of 28 cases) is my check program > output. In my case, I known the 3th block is corrupt, so need other 5 blocks > to decode another 3 blocks, then find the 1th block's LCS substring is block > length - 64kb. > It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the > dirty buffer was used before read the 1th block. > Must be noted that StripedBlockReader read from the offset 0 of the 1th block > after used the dirty buffer. > {code:java} > decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] > Check Block(1) first 131072 bytes longest common substring length 4 > Check Block(6) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4 > decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] > Check Block(1) first 131072 bytes longest common substring length 65536 > CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length > 27197440 # this one > Check Block(7) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4{code} > Now I know the dirty buffer causes reconstruction block error, but how does > the dirty buffer come about? > After digging into the code and DN log, I found this following DN log is the > root reason. > {code:java} > [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel > java.nio.channels.SocketChannel[connected local=/:52586 > remote=/:50010]. 18 millis timeout left. > [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped > block: BP-714356632--1519726836856:blk_-YY_3472979393 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) {code} > Reading from DN may timeout(hold by a future(F)) and output the INFO log, but > the futures that contains the future(F) is cleared, > {code:java} > return new StripingChunkReadResult(futures.remove(future), > StripingChunkReadResult.CANCELLED); {code} > futures.remove(future) cause NPE. So the EC reconstruction is failed. In the > finally phase, the code snippet in *getStripedReader().close()* > {code:java} > reconstructor.freeBuffer(reader.getReadBuffer()); > reader.freeReadBuffer(); > reader.closeBlockReader(); {code} > free buffer firstly, but the StripedBlockReader still holds the buffer and > write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To
[jira] [Commented] (HDFS-15235) Transient network failure during NameNode failover makes cluster unavailable
[ https://issues.apache.org/jira/browse/HDFS-15235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067649#comment-17067649 ] YCozy commented on HDFS-15235: -- [~ayushtkn], could you please take a look at the patch? Thanks! > Transient network failure during NameNode failover makes cluster unavailable > > > Key: HDFS-15235 > URL: https://issues.apache.org/jira/browse/HDFS-15235 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: YCozy >Assignee: YCozy >Priority: Major > Attachments: HDFS-15235.001.patch > > > We have an HA cluster with two NameNodes: an active NN1 and a standby NN2. At > some point, NN1 becomes unhealthy and the admin tries to manually failover to > NN2 by running command > {code:java} > $ hdfs haadmin -failover NN1 NN2 > {code} > NN2 receives the request and becomes active: > {code:java} > 2020-03-24 00:24:56,412 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services > started for standby state > 2020-03-24 00:24:56,413 WARN > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer > interrupted: sleep interrupted > 2020-03-24 00:24:56,415 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services > required for active state > 2020-03-24 00:24:56,417 INFO > org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering > unfinalized segments in /app/ha-name-dir-shared/current > 2020-03-24 00:24:56,419 INFO > org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering > unfinalized segments in /app/nn2/name/current > 2020-03-24 00:24:56,419 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Catching up to latest > edits from old active before taking over writer role in edits logs > 2020-03-24 00:24:56,435 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: > Reading > org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@7c3095fa > expecting start txid #1 > 2020-03-24 00:24:56,436 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: > Start loading edits file > /app/ha-name-dir-shared/current/edits_001-019 > maxTxnsToRead = 9223372036854775807 > 2020-03-24 00:24:56,441 INFO > org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: > Fast-forwarding stream > '/app/ha-name-dir-shared/current/edits_001-019' > to transaction ID 1 > 2020-03-24 00:24:56,567 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: > Loaded 1 edits file(s) (the last named > /app/ha-name-dir-shared/current/edits_001-019) > of total size 1305.0, total edits 19.0, total load time 109.0 ms > 2020-03-24 00:24:56,567 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all > datanodes as stale > 2020-03-24 00:24:56,568 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Processing 4 > messages from DataNodes that were previously queued during standby state > 2020-03-24 00:24:56,569 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing replication > and invalidation queues > 2020-03-24 00:24:56,569 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: initializing > replication queues > 2020-03-24 00:24:56,570 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Will take over writing > edit logs at txnid 20 > 2020-03-24 00:24:56,571 INFO > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 20 > 2020-03-24 00:24:56,812 INFO > org.apache.hadoop.hdfs.server.namenode.FSDirectory: Initializing quota with 4 > thread(s) > 2020-03-24 00:24:56,819 INFO > org.apache.hadoop.hdfs.server.namenode.FSDirectory: Quota initialization > completed in 6 millisecondsname space=3storage space=24690storage > types=RAM_DISK=0, SSD=0, DISK=0, ARCHIVE=0, PROVIDED=0 > 2020-03-24 00:24:56,827 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Starting CacheReplicationMonitor with interval 3 milliseconds > {code} > But NN2 fails to send back the RPC response because of temporary network > partitioning. > {code:java} > java.io.EOFException: End of File Exception between local host is: > "24e7b5a52e85/172.17.0.2"; destination host is: "127.0.0.3":8180; : > java.io.EOFException; For more details see: > http://wiki.apache.org/hadoop/EOFException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at
[jira] [Updated] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] HuangTao updated HDFS-15240: Description: When read some lzo files we found some blocks were broken. I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the longest common sequenece(LCS) between b6'(decoded) and b6(read from DN)(b7'/b7 and b8'/b8). After selecting 6 blocks of the block group in combinations one time and iterating through all cases, I find one case that the length of LCS is the block length - 64KB, 64KB is just the length of ByteBuffer used by StripedBlockReader. So the corrupt reconstruction block is made by a dirty buffer. The following log snippet(only show 2 of 28 cases) is my check program output. In my case, I known the 3th block is corrupt, so need other 5 blocks to decode another 3 blocks, then find the 1th block's LCS substring is block length - 64kb. It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the dirty buffer was used before read the 1th block. Must be noted that StripedBlockReader read from the offset 0 of the 1th block after used the dirty buffer. {code:java} decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] Check Block(1) first 131072 bytes longest common substring length 4 Check Block(6) first 131072 bytes longest common substring length 4 Check Block(8) first 131072 bytes longest common substring length 4 decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] Check Block(1) first 131072 bytes longest common substring length 65536 CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 27197440 # this one Check Block(7) first 131072 bytes longest common substring length 4 Check Block(8) first 131072 bytes longest common substring length 4{code} Now I know the dirty buffer causes reconstruction block error, but how does the dirty buffer come about? After digging into the code and DN log, I found this following DN log is the root reason. {code:java} [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/:52586 remote=/:50010]. 18 millis timeout left. [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped block: BP-714356632--1519726836856:blk_-YY_3472979393 java.lang.NullPointerException at org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) {code} Reading from DN may timeout(hold by a future(F)) and output the INFO log, but the futures that contains the future(F) is cleared, {code:java} return new StripingChunkReadResult(futures.remove(future), StripingChunkReadResult.CANCELLED); {code} futures.remove(future) cause NPE. So the EC reconstruction is failed. In the finally phase, the code snippet in *getStripedReader().close()* {code:java} reconstructor.freeBuffer(reader.getReadBuffer()); reader.freeReadBuffer(); reader.closeBlockReader(); {code} free buffer firstly, but the StripedBlockReader still holds the buffer and write it. was: When read some lzo files we found some blocks were broken. I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the longest common sequenece(LCS) between b6'(decoded) and b6(read from DN)(b7'/b7 and b8'/b8). After selecting 6 blocks of the block group in combinations one time and iterating through all cases, I find one case that the length of LCS is the block length - 64KB, 64KB is just the length of ByteBuffer used by StripedBlockReader. So the corrupt reconstruction block is made by a dirty buffer. The following log snippet(only show 2 of 28 cases) is my check program output. In my case, I known the 3th block is corrupt, so need other 5 blocks to decode another 3 blocks, then find the 1th block's LCS substring is block length
[jira] [Updated] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] HuangTao updated HDFS-15240: Description: When read some lzo files we found some blocks were broken. I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the longest common sequenece(LCS) between b6'(decoded) and b6(read from DN)(b7'/b7 and b8'/b8). After selecting 6 blocks of the block group in combinations one time and iterating through all cases, I find one case that the length of LCS is the block length - 64KB, 64KB is just the length of ByteBuffer used by StripedBlockReader. So the corrupt reconstruction block is made by a dirty buffer. The following log snippet(only show 2 of 28 cases) is my check program output. In my case, I known the 3th block is corrupt, so need other 5 blocks to decode another 3 blocks, then find the 1th block's LCS substring is block length - 64kb. It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the dirty buffer was used before read the 1th block. {code:java} decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] Check Block(1) first 131072 bytes longest common substring length 4 Check Block(6) first 131072 bytes longest common substring length 4 Check Block(8) first 131072 bytes longest common substring length 4 decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] Check Block(1) first 131072 bytes longest common substring length 65536 CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 27197440 # this one Check Block(7) first 131072 bytes longest common substring length 4 Check Block(8) first 131072 bytes longest common substring length 4{code} Now I know the dirty buffer causes reconstruction block error, but how does the dirty buffer come about? After digging into the code and DN log, I found this following DN log is the root reason. {code:java} [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/:52586 remote=/:50010]. 18 millis timeout left. [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped block: BP-714356632--1519726836856:blk_-YY_3472979393 java.lang.NullPointerException at org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) {code} Reading from DN may timeout(hold by a future(F)) and output the INFO log, but the futures that contains the future(F) is cleared, {code:java} return new StripingChunkReadResult(futures.remove(future), StripingChunkReadResult.CANCELLED); {code} futures.remove(future) cause NPE. So the EC reconstruction is failed. In the finally phase, the code snippet in *getStripedReader().close()* {code:java} reconstructor.freeBuffer(reader.getReadBuffer()); reader.freeReadBuffer(); reader.closeBlockReader(); {code} free buffer firstly, but the StripedBlockReader still holds the buffer and write it. was: When read some lzo files we found some blocks were broken. I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k), and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the longest common sequenece(LCS) between b6' and b6(b7'/b7 and b8'/b8). After selecting 6 blocks of the block group in combinations one time and iterating through all cases, I find one case that the length of LCS is the block length - 64KB, 64KB is just the length of ByteBuffer used by StripedBlockReader. So the corrupt reconstruction block is made by a dirty buffer. The following log snippet is my check program output {code:java} decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] Check Block(1) first 131072 bytes longest common substring length 4 Check Block(6) first 131072 bytes longest common substring length 4 Check Block(8) first 131072 bytes longest common substring length 4 decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] Check Block(1) first
[jira] [Commented] (HDFS-13470) RBF: Add Browse the Filesystem button to the UI
[ https://issues.apache.org/jira/browse/HDFS-13470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067643#comment-17067643 ] Ayush Saxena commented on HDFS-13470: - Thanx [~elgoiri] for the update. v002 LGTM +1 > RBF: Add Browse the Filesystem button to the UI > --- > > Key: HDFS-13470 > URL: https://issues.apache.org/jira/browse/HDFS-13470 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Íñigo Goiri >Assignee: Íñigo Goiri >Priority: Major > Attachments: HDFS-13470.000.patch, HDFS-13470.001.patch, > HDFS-13470.002.patch > > > After HDFS-12512 added WebHDFS, we can add the support to browse the > filesystem to the UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067569#comment-17067569 ] Yao Guangdong commented on HDFS-15240: -- [~weichiu] We have the same problem. PTAL. > Erasure Coding: dirty buffer causes reconstruction block error > -- > > Key: HDFS-15240 > URL: https://issues.apache.org/jira/browse/HDFS-15240 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Reporter: HuangTao >Assignee: HuangTao >Priority: Major > Attachments: HDFS-15240.001.patch > > > When read some lzo files we found some blocks were broken. > I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k), and > choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the > longest common sequenece(LCS) between b6' and b6(b7'/b7 and b8'/b8). > After selecting 6 blocks of the block group in combinations one time and > iterating through all cases, I find one case that the length of LCS is the > block length - 64KB, 64KB is just the length of ByteBuffer used by > StripedBlockReader. So the corrupt reconstruction block is made by a dirty > buffer. > The following log snippet is my check program output > {code:java} > decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] > Check Block(1) first 131072 bytes longest common substring length 4 > Check Block(6) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4 > decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] > Check Block(1) first 131072 bytes longest common substring length 65536 > CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length > 27197440 # this one > Check Block(7) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4{code} > Now I know the dirty buffer causes reconstruction block error, but how does > the dirty buffer come about? > After digging into the code and DN log, I found this following DN log is the > root reason. > {code:java} > [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel > java.nio.channels.SocketChannel[connected local=/:52586 > remote=/:50010]. 18 millis timeout left. > [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped > block: BP-714356632--1519726836856:blk_-YY_3472979393 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) {code} > Reading from DN may timeout(hold by a future(F)) and output the INFO log, but > the futures that contains the future(F) is cleared, > {code:java} > return new StripingChunkReadResult(futures.remove(future), > StripingChunkReadResult.CANCELLED); {code} > futures.remove(future) cause NPE. So the EC reconstruction is failed. In the > finally phase, the code snippet in *getStripedReader().close()* > {code:java} > reconstructor.freeBuffer(reader.getReadBuffer()); > reader.freeReadBuffer(); > reader.closeBlockReader(); {code} > free buffer firstly, but the StripedBlockReader still holds the buffer and > write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067559#comment-17067559 ] guojh commented on HDFS-15240: -- [~surendrasingh] We have the same problem,please take a look. > Erasure Coding: dirty buffer causes reconstruction block error > -- > > Key: HDFS-15240 > URL: https://issues.apache.org/jira/browse/HDFS-15240 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Reporter: HuangTao >Assignee: HuangTao >Priority: Major > Attachments: HDFS-15240.001.patch > > > When read some lzo files we found some blocks were broken. > I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k), and > choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the > longest common sequenece(LCS) between b6' and b6(b7'/b7 and b8'/b8). > After selecting 6 blocks of the block group in combinations one time and > iterating through all cases, I find one case that the length of LCS is the > block length - 64KB, 64KB is just the length of ByteBuffer used by > StripedBlockReader. So the corrupt reconstruction block is made by a dirty > buffer. > The following log snippet is my check program output > {code:java} > decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] > Check Block(1) first 131072 bytes longest common substring length 4 > Check Block(6) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4 > decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] > Check Block(1) first 131072 bytes longest common substring length 65536 > CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length > 27197440 # this one > Check Block(7) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4{code} > Now I know the dirty buffer causes reconstruction block error, but how does > the dirty buffer come about? > After digging into the code and DN log, I found this following DN log is the > root reason. > {code:java} > [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel > java.nio.channels.SocketChannel[connected local=/:52586 > remote=/:50010]. 18 millis timeout left. > [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped > block: BP-714356632--1519726836856:blk_-YY_3472979393 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) {code} > Reading from DN may timeout(hold by a future(F)) and output the INFO log, but > the futures that contains the future(F) is cleared, > {code:java} > return new StripingChunkReadResult(futures.remove(future), > StripingChunkReadResult.CANCELLED); {code} > futures.remove(future) cause NPE. So the EC reconstruction is failed. In the > finally phase, the code snippet in *getStripedReader().close()* > {code:java} > reconstructor.freeBuffer(reader.getReadBuffer()); > reader.freeReadBuffer(); > reader.closeBlockReader(); {code} > free buffer firstly, but the StripedBlockReader still holds the buffer and > write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067539#comment-17067539 ] HuangTao commented on HDFS-15240: - [~weichiu] PTAL, Thanks > Erasure Coding: dirty buffer causes reconstruction block error > -- > > Key: HDFS-15240 > URL: https://issues.apache.org/jira/browse/HDFS-15240 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Reporter: HuangTao >Assignee: HuangTao >Priority: Major > Attachments: HDFS-15240.001.patch > > > When read some lzo files we found some blocks were broken. > I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k), and > choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the > longest common sequenece(LCS) between b6' and b6(b7'/b7 and b8'/b8). > After selecting 6 blocks of the block group in combinations one time and > iterating through all cases, I find one case that the length of LCS is the > block length - 64KB, 64KB is just the length of ByteBuffer used by > StripedBlockReader. So the corrupt reconstruction block is made by a dirty > buffer. > The following log snippet is my check program output > {code:java} > decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] > Check Block(1) first 131072 bytes longest common substring length 4 > Check Block(6) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4 > decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] > Check Block(1) first 131072 bytes longest common substring length 65536 > CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length > 27197440 # this one > Check Block(7) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4{code} > Now I know the dirty buffer causes reconstruction block error, but how does > the dirty buffer come about? > After digging into the code and DN log, I found this following DN log is the > root reason. > {code:java} > [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel > java.nio.channels.SocketChannel[connected local=/:52586 > remote=/:50010]. 18 millis timeout left. > [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped > block: BP-714356632--1519726836856:blk_-YY_3472979393 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) {code} > Reading from DN may timeout(hold by a future(F)) and output the INFO log, but > the futures that contains the future(F) is cleared, > {code:java} > return new StripingChunkReadResult(futures.remove(future), > StripingChunkReadResult.CANCELLED); {code} > futures.remove(future) cause NPE. So the EC reconstruction is failed. In the > finally phase, the code snippet in *getStripedReader().close()* > {code:java} > reconstructor.freeBuffer(reader.getReadBuffer()); > reader.freeReadBuffer(); > reader.closeBlockReader(); {code} > free buffer firstly, but the StripedBlockReader still holds the buffer and > write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] HuangTao updated HDFS-15240: Attachment: HDFS-15240.001.patch > Erasure Coding: dirty buffer causes reconstruction block error > -- > > Key: HDFS-15240 > URL: https://issues.apache.org/jira/browse/HDFS-15240 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Reporter: HuangTao >Assignee: HuangTao >Priority: Major > Attachments: HDFS-15240.001.patch > > > When read some lzo files we found some blocks were broken. > I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k), and > choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the > longest common sequenece(LCS) between b6' and b6(b7'/b7 and b8'/b8). > After selecting 6 blocks of the block group in combinations one time and > iterating through all cases, I find one case that the length of LCS is the > block length - 64KB, 64KB is just the length of ByteBuffer used by > StripedBlockReader. So the corrupt reconstruction block is made by a dirty > buffer. > The following log snippet is my check program output > {code:java} > decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] > Check Block(1) first 131072 bytes longest common substring length 4 > Check Block(6) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4 > decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] > Check Block(1) first 131072 bytes longest common substring length 65536 > CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length > 27197440 # this one > Check Block(7) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4{code} > Now I know the dirty buffer causes reconstruction block error, but how does > the dirty buffer come about? > After digging into the code and DN log, I found this following DN log is the > root reason. > {code:java} > [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel > java.nio.channels.SocketChannel[connected local=/:52586 > remote=/:50010]. 18 millis timeout left. > [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped > block: BP-714356632--1519726836856:blk_-YY_3472979393 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) {code} > Reading from DN may timeout(hold by a future(F)) and output the INFO log, but > the futures that contains the future(F) is cleared, > {code:java} > return new StripingChunkReadResult(futures.remove(future), > StripingChunkReadResult.CANCELLED); {code} > futures.remove(future) cause NPE. So the EC reconstruction is failed. In the > finally phase, the code snippet in *getStripedReader().close()* > {code:java} > reconstructor.freeBuffer(reader.getReadBuffer()); > reader.freeReadBuffer(); > reader.closeBlockReader(); {code} > free buffer firstly, but the StripedBlockReader still holds the buffer and > write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15243) Child directory should not be deleted or renamed if parent directory is a protected directory
liuyanyu created HDFS-15243: --- Summary: Child directory should not be deleted or renamed if parent directory is a protected directory Key: HDFS-15243 URL: https://issues.apache.org/jira/browse/HDFS-15243 Project: Hadoop HDFS Issue Type: Bug Components: 3.1.1 Affects Versions: 3.1.1 Reporter: liuyanyu HDFS-8983 add fs.protected.directories to support protected directories on NameNode. But as I test, when set a parent directory(eg /testA) to protected directory, the child directory (eg /testA/testB) still can be deleted or renamed. When we protect a directory mainly for protecting the data under this directory , So I think the child directory should not be delete or renamed if the parent directory is a protected directory. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] HuangTao updated HDFS-15240: Description: When read some lzo files we found some blocks were broken. I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k), and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the longest common sequenece(LCS) between b6' and b6(b7'/b7 and b8'/b8). After selecting 6 blocks of the block group in combinations one time and iterating through all cases, I find one case that the length of LCS is the block length - 64KB, 64KB is just the length of ByteBuffer used by StripedBlockReader. So the corrupt reconstruction block is made by a dirty buffer. The following log snippet is my check program output {code:java} decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] Check Block(1) first 131072 bytes longest common substring length 4 Check Block(6) first 131072 bytes longest common substring length 4 Check Block(8) first 131072 bytes longest common substring length 4 decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] Check Block(1) first 131072 bytes longest common substring length 65536 CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 27197440 # this one Check Block(7) first 131072 bytes longest common substring length 4 Check Block(8) first 131072 bytes longest common substring length 4{code} Now I know the dirty buffer causes reconstruction block error, but how does the dirty buffer come about? After digging into the code and DN log, I found this following DN log is the root reason. {code:java} [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/:52586 remote=/:50010]. 18 millis timeout left. [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped block: BP-714356632--1519726836856:blk_-YY_3472979393 java.lang.NullPointerException at org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) {code} Reading from DN may timeout(hold by a future(F)) and output the INFO log, but the futures that contains the future(F) is cleared, {code:java} return new StripingChunkReadResult(futures.remove(future), StripingChunkReadResult.CANCELLED); {code} futures.remove(future) cause NPE. So the EC reconstruction is failed. In the finally phase, the code snippet in *getStripedReader().close()* {code:java} reconstructor.freeBuffer(reader.getReadBuffer()); reader.freeReadBuffer(); reader.closeBlockReader(); {code} free buffer firstly, but the StripedBlockReader still holds the buffer and write it. was: When read some lzo files we found some blocks were broken. I read back all internal blocks of the block group(RS-6-3-1024k), and choose 6 blocks to decode other 3 block > Erasure Coding: dirty buffer causes reconstruction block error > -- > > Key: HDFS-15240 > URL: https://issues.apache.org/jira/browse/HDFS-15240 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Reporter: HuangTao >Assignee: HuangTao >Priority: Major > > When read some lzo files we found some blocks were broken. > I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k), and > choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the > longest common sequenece(LCS) between b6' and b6(b7'/b7 and b8'/b8). > After selecting 6 blocks of the block group in combinations one time and > iterating through all cases, I find one case that the length of LCS is the > block length - 64KB, 64KB is just the length of ByteBuffer used by > StripedBlockReader. So the corrupt reconstruction block is made by a dirty > buffer. > The following log snippet is my check program output > {code:java} > decode from [0, 2, 3, 4, 5, 7] -> [1,
[jira] [Commented] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier
[ https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067520#comment-17067520 ] Steven Rand commented on HDFS-15191: Hi [~vagarychen], I'm wondering if you can take a look at the patch, or if you know someone else who could? Also, I wonder whether this is important enough to be part of the 3.3.0 release? As far as I can tell, it breaks backcompat with 2.x clusters where SASL is being used, which seems like a significant regression to me. Thanks, Steve > EOF when reading legacy buffer in BlockTokenIdentifier > -- > > Key: HDFS-15191 > URL: https://issues.apache.org/jira/browse/HDFS-15191 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.2.1 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, > HDFS-15191.003.patch, HDFS-15191.004.patch > > > We have an HDFS client application which recently upgraded from 3.2.0 to > 3.2.1. After this upgrade (but not before), we sometimes see these errors > when this application is used with clusters still running Hadoop 2.x (more > specifically CDH 5.12.1): > {code} > WARN [2020-02-24T00:54:32.856Z] > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing > remote block reader. (_sampled: true) > java.io.EOFException: > at java.io.DataInputStream.readByte(DataInputStream.java:272) > at > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) > at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221) > at > org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730) > at > org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644) > at > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246) > at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765) > {code} > We get this warning for all DataNodes with a copy of the block, so the read > fails. > I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to > cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging > [~vagarychen] in case you have any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl
[ https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067482#comment-17067482 ] Hadoop QA commented on HDFS-15242: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 46s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 20s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 41s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 20m 13s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 38s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 20s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 15m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 39s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 45s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 9m 14s{color} | {color:red} hadoop-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}108m 52s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 48s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}230m 3s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.security.TestFixKerberosTicketOrder | | | hadoop.security.TestRaceWhenRelogin | | | hadoop.hdfs.server.namenode.ha.TestBootstrapAliasmap | | | hadoop.hdfs.server.balancer.TestBalancer | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:4454c6d14b7 | | JIRA Issue | HDFS-15242 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12997785/HDFS-15242.002.patch | | Optional Tests | dupname asflicense mvnsite compile javac javadoc mvninstall unit shadedclient findbugs checkstyle | | uname | Linux dd0468cedec7 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0fa7bf4 | | maven | version: Apache Maven 3.3.9
[jira] [Commented] (HDFS-12733) Option to disable to namenode local edits
[ https://issues.apache.org/jira/browse/HDFS-12733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067442#comment-17067442 ] Xiaoqiao He commented on HDFS-12733: Thanks [~elgoiri],[~ayushtkn] for your great feedback. v008 does rely on set `dfs.namenode.edits.dir` to blank to disable local edit. Yes, we need more information to clear this changes. It is true that their actions are different, IMO it is feasible to unify it that we disable local edits if config blank and tell our end user explicitly. Of course, patch v008 is not a complete story currently. I would like to update the patch if no objection. Thanks again [~elgoiri],[~ayushtkn] for your suggestions. > Option to disable to namenode local edits > - > > Key: HDFS-12733 > URL: https://issues.apache.org/jira/browse/HDFS-12733 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode, performance >Reporter: Brahma Reddy Battula >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-12733-001.patch, HDFS-12733-002.patch, > HDFS-12733-003.patch, HDFS-12733.004.patch, HDFS-12733.005.patch, > HDFS-12733.006.patch, HDFS-12733.007.patch, HDFS-12733.008.patch > > > As of now, Edits will be written in local and shared locations which will be > redundant and local edits never used in HA setup. > Disabling local edits gives little performance improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org