[jira] [Comment Edited] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt

2020-03-26 Thread zhengchenyu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065461#comment-17065461
 ] 

zhengchenyu edited comment on HDFS-15237 at 3/27/20, 5:14 AM:
--

Here comment some strange phenomenon. Why this error happen? Because I found 
some wrong internal block distribution like this: 
{code:java}
0. BP-1936287042-10.200.128.33-1573194961291:blk_-9223372036797335472_3783205 
len=247453749 Live_repl=8 
[blk_-9223372036797335472:DatanodeInfoWithStorage[10.200.128.43:9866,DS-2ddde0b8-6a84-4d06-8a40-d4ae5691e81c,DISK],
 
 
blk_-9223372036797335471:DatanodeInfoWithStorage[10.200.128.41:9866,DS-a4fc5486-6c45-481e-84e7-9393eeaf1313,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.50:9866,DS-fc0632c6-8916-42d8-8219-57b022bb2786,DISK],
 
 
blk_-9223372036797335469:DatanodeInfoWithStorage[10.200.128.54:9866,DS-1b6cb52a-f55a-4ef8-beaf-a5d7b7fe93aa,DISK],
 
 
blk_-9223372036797335467:DatanodeInfoWithStorage[10.200.128.52:9866,DS-fc6e00dd-ca5a-4580-9403-aeb6906da81a,DISK],
 
 
blk_-9223372036797335466:DatanodeInfoWithStorage[10.200.128.53:9866,DS-2c926a3b-64c0-441b-abe2-188e79918abe,DISK],
 
 
blk_-9223372036797335465:DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK],
 
 
blk_-9223372036797335464:DatanodeInfoWithStorage[10.200.128.44:9866,DS-3725af76-fe86-4f97-9740-d77bfa339b3f,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.45:9866,DS-250fd4cf-705f-4cb5-bc3a-c7a105247e35,DISK]]

{code}
this is the result of hdfs fsck. Your can see this block group has 9 internal 
block, but no blk_-9223372036797335468, two repeated blk_-9223372036797335470. 
Through the distribution is wrong, the blcok group has 9 internal block, so the 
replicatorMonitor can't repair this error. I think this may be another issue.  
([~gjhkael] remid HDFS-14754 is the relative issue.)

this block is too old so that the log is missing, so I don't know the reason, 
and can't reproduction this error now. 


was (Author: zhengchenyu):
Here comment some strange phenomenon. Why this error happen? Because I found 
some wrong internal block distribution like this: 
{code:java}
0. BP-1936287042-10.200.128.33-1573194961291:blk_-9223372036797335472_3783205 
len=247453749 Live_repl=8 
[blk_-9223372036797335472:DatanodeInfoWithStorage[10.200.128.43:9866,DS-2ddde0b8-6a84-4d06-8a40-d4ae5691e81c,DISK],
 
 
blk_-9223372036797335471:DatanodeInfoWithStorage[10.200.128.41:9866,DS-a4fc5486-6c45-481e-84e7-9393eeaf1313,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.50:9866,DS-fc0632c6-8916-42d8-8219-57b022bb2786,DISK],
 
 
blk_-9223372036797335469:DatanodeInfoWithStorage[10.200.128.54:9866,DS-1b6cb52a-f55a-4ef8-beaf-a5d7b7fe93aa,DISK],
 
 
blk_-9223372036797335467:DatanodeInfoWithStorage[10.200.128.52:9866,DS-fc6e00dd-ca5a-4580-9403-aeb6906da81a,DISK],
 
 
blk_-9223372036797335466:DatanodeInfoWithStorage[10.200.128.53:9866,DS-2c926a3b-64c0-441b-abe2-188e79918abe,DISK],
 
 
blk_-9223372036797335465:DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK],
 
 
blk_-9223372036797335464:DatanodeInfoWithStorage[10.200.128.44:9866,DS-3725af76-fe86-4f97-9740-d77bfa339b3f,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.45:9866,DS-250fd4cf-705f-4cb5-bc3a-c7a105247e35,DISK]]

{code}
this is the result of hdfs fsck. Your can see this block group has 9 internal 
block, but no blk_-9223372036797335468, two repeated blk_-9223372036797335470. 
Through the distribution is wrong, the blcok group has 9 internal block, so the 
replicatorMonitor can't repair this error. I think this may be another issue.  
([~gjhkael] remid HDFS-14754 is the relative issue.)

this block is too old so that the log is missing, so I don't know the reason, 
and can't reproduction this error now. 

> Get checksum of EC file failed, when some block is missing or corrupt
> -
>
> Key: HDFS-15237
> URL: https://issues.apache.org/jira/browse/HDFS-15237
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Priority: Major
> Fix For: 3.2.2
>
>
> When we distcp from an ec directory to another one, I found some error like 
> this.
> {code}
> 2020-03-20 20:18:21,366 WARN [main] 
> org.apache.hadoop.hdfs.FileChecksumHelper: src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20
>  20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: 
> src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException:
>  Unexpected EOF while trying 

[jira] [Comment Edited] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt

2020-03-26 Thread zhengchenyu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065461#comment-17065461
 ] 

zhengchenyu edited comment on HDFS-15237 at 3/27/20, 5:13 AM:
--

Here comment some strange phenomenon. Why this error happen? Because I found 
some wrong internal block distribution like this: 
{code:java}
0. BP-1936287042-10.200.128.33-1573194961291:blk_-9223372036797335472_3783205 
len=247453749 Live_repl=8 
[blk_-9223372036797335472:DatanodeInfoWithStorage[10.200.128.43:9866,DS-2ddde0b8-6a84-4d06-8a40-d4ae5691e81c,DISK],
 
 
blk_-9223372036797335471:DatanodeInfoWithStorage[10.200.128.41:9866,DS-a4fc5486-6c45-481e-84e7-9393eeaf1313,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.50:9866,DS-fc0632c6-8916-42d8-8219-57b022bb2786,DISK],
 
 
blk_-9223372036797335469:DatanodeInfoWithStorage[10.200.128.54:9866,DS-1b6cb52a-f55a-4ef8-beaf-a5d7b7fe93aa,DISK],
 
 
blk_-9223372036797335467:DatanodeInfoWithStorage[10.200.128.52:9866,DS-fc6e00dd-ca5a-4580-9403-aeb6906da81a,DISK],
 
 
blk_-9223372036797335466:DatanodeInfoWithStorage[10.200.128.53:9866,DS-2c926a3b-64c0-441b-abe2-188e79918abe,DISK],
 
 
blk_-9223372036797335465:DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK],
 
 
blk_-9223372036797335464:DatanodeInfoWithStorage[10.200.128.44:9866,DS-3725af76-fe86-4f97-9740-d77bfa339b3f,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.45:9866,DS-250fd4cf-705f-4cb5-bc3a-c7a105247e35,DISK]]

{code}
this is the result of hdfs fsck. Your can see this block group has 9 internal 
block, but no blk_-9223372036797335468, two repeated blk_-9223372036797335470. 
Through the distribution is wrong, the blcok group has 9 internal block, so the 
replicatorMonitor can't repair this error. I think this may be another issue.  
([~gjhkael] remid HDFS-14754 is the relative issue.)

this block is too old so that the log is missing, so I don't know the reason, 
and can't reproduction this error now. 


was (Author: zhengchenyu):
Here comment some strange phenomenon. Why this error happen? Because I found 
some wrong internal block distribution like this: 
{code:java}
0. BP-1936287042-10.200.128.33-1573194961291:blk_-9223372036797335472_3783205 
len=247453749 Live_repl=8 
[blk_-9223372036797335472:DatanodeInfoWithStorage[10.200.128.43:9866,DS-2ddde0b8-6a84-4d06-8a40-d4ae5691e81c,DISK],
 
 
blk_-9223372036797335471:DatanodeInfoWithStorage[10.200.128.41:9866,DS-a4fc5486-6c45-481e-84e7-9393eeaf1313,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.50:9866,DS-fc0632c6-8916-42d8-8219-57b022bb2786,DISK],
 
 
blk_-9223372036797335469:DatanodeInfoWithStorage[10.200.128.54:9866,DS-1b6cb52a-f55a-4ef8-beaf-a5d7b7fe93aa,DISK],
 
 
blk_-9223372036797335467:DatanodeInfoWithStorage[10.200.128.52:9866,DS-fc6e00dd-ca5a-4580-9403-aeb6906da81a,DISK],
 
 
blk_-9223372036797335466:DatanodeInfoWithStorage[10.200.128.53:9866,DS-2c926a3b-64c0-441b-abe2-188e79918abe,DISK],
 
 
blk_-9223372036797335465:DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK],
 
 
blk_-9223372036797335464:DatanodeInfoWithStorage[10.200.128.44:9866,DS-3725af76-fe86-4f97-9740-d77bfa339b3f,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.45:9866,DS-250fd4cf-705f-4cb5-bc3a-c7a105247e35,DISK]]

{code}
this is the result of hdfs fsck. Your can see this block group has 9 internal 
block, but no blk_-9223372036797335468, two repeated blk_-9223372036797335470. 
Through the distribution is wrong, the blcok group has 9 internal block, so the 
replicatorMonitor can't repair this error. I think this may be another issue. 

this block is too old so that the log is missing, so I don't know the reason, 
and can't reproduction this error now. 

> Get checksum of EC file failed, when some block is missing or corrupt
> -
>
> Key: HDFS-15237
> URL: https://issues.apache.org/jira/browse/HDFS-15237
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Priority: Major
> Fix For: 3.2.2
>
>
> When we distcp from an ec directory to another one, I found some error like 
> this.
> {code}
> 2020-03-20 20:18:21,366 WARN [main] 
> org.apache.hadoop.hdfs.FileChecksumHelper: src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20
>  20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: 
> src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException:
>  Unexpected EOF while trying to read response from server at 
> 

[jira] [Commented] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt

2020-03-26 Thread zhengchenyu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068275#comment-17068275
 ] 

zhengchenyu commented on HDFS-15237:


[~gjhkael] Oh, I see. I review  HDFS-14754,  it just slove my problem. Thank 
you!

> Get checksum of EC file failed, when some block is missing or corrupt
> -
>
> Key: HDFS-15237
> URL: https://issues.apache.org/jira/browse/HDFS-15237
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Priority: Major
> Fix For: 3.2.2
>
>
> When we distcp from an ec directory to another one, I found some error like 
> this.
> {code}
> 2020-03-20 20:18:21,366 WARN [main] 
> org.apache.hadoop.hdfs.FileChecksumHelper: src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20
>  20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: 
> src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException:
>  Unexpected EOF while trying to read response from server at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:550)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.tryDatanode(FileChecksumHelper.java:709)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlockGroup(FileChecksumHelper.java:664)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlocks(FileChecksumHelper.java:638)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$FileChecksumComputer.compute(FileChecksumHelper.java:252)
>  at 
> org.apache.hadoop.hdfs.DFSClient.getFileChecksumInternal(DFSClient.java:1790) 
> at 
> org.apache.hadoop.hdfs.DFSClient.getFileChecksumWithCombineMode(DFSClient.java:1810)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1691)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1688)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:1700)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:138)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:115)
>  at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
>  at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:259)
>  at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:220) at 
> org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48) at 
> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at 
> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
> {code}
> And Then I found some error in datanode like this
> {code}
> 2020-03-20 20:54:16,573 INFO 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: 
> SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = 
> false
> 2020-03-20 20:54:16,577 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> bd-hadoop-128050.zeus.lianjia.com:9866:DataXceiver error processing 
> BLOCK_GROUP_CHECKSUM operation src: /10.201.1.38:33264 dst: 
> /10.200.128.50:9866
> java.lang.UnsupportedOperationException
>  at java.nio.ByteBuffer.array(ByteBuffer.java:994)
>  at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockChecksumReconstructor.reconstruct(StripedBlockChecksumReconstructor.java:90)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.recalculateChecksum(BlockChecksumHelper.java:711)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.compute(BlockChecksumHelper.java:489)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.blockGroupChecksum(DataXceiver.java:1047)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opStripedBlockChecksum(Receiver.java:327)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:119)

[jira] [Comment Edited] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt

2020-03-26 Thread guojh (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068263#comment-17068263
 ] 

guojh edited comment on HDFS-15237 at 3/27/20, 4:42 AM:


[~zhengchenyu] Have you try this patch 
[HDFS-14754|https://issues.apache.org/jira/browse/HDFS-14754] ? Looks like the 
redundant block has not been deleted first.


was (Author: gjhkael):
[~zhengchenyu] Have you try this patch 
(https://issues.apache.org/jira/browse/HDFS-14754)

> Get checksum of EC file failed, when some block is missing or corrupt
> -
>
> Key: HDFS-15237
> URL: https://issues.apache.org/jira/browse/HDFS-15237
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Priority: Major
> Fix For: 3.2.2
>
>
> When we distcp from an ec directory to another one, I found some error like 
> this.
> {code}
> 2020-03-20 20:18:21,366 WARN [main] 
> org.apache.hadoop.hdfs.FileChecksumHelper: src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20
>  20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: 
> src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException:
>  Unexpected EOF while trying to read response from server at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:550)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.tryDatanode(FileChecksumHelper.java:709)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlockGroup(FileChecksumHelper.java:664)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlocks(FileChecksumHelper.java:638)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$FileChecksumComputer.compute(FileChecksumHelper.java:252)
>  at 
> org.apache.hadoop.hdfs.DFSClient.getFileChecksumInternal(DFSClient.java:1790) 
> at 
> org.apache.hadoop.hdfs.DFSClient.getFileChecksumWithCombineMode(DFSClient.java:1810)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1691)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1688)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:1700)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:138)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:115)
>  at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
>  at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:259)
>  at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:220) at 
> org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48) at 
> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at 
> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
> {code}
> And Then I found some error in datanode like this
> {code}
> 2020-03-20 20:54:16,573 INFO 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: 
> SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = 
> false
> 2020-03-20 20:54:16,577 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> bd-hadoop-128050.zeus.lianjia.com:9866:DataXceiver error processing 
> BLOCK_GROUP_CHECKSUM operation src: /10.201.1.38:33264 dst: 
> /10.200.128.50:9866
> java.lang.UnsupportedOperationException
>  at java.nio.ByteBuffer.array(ByteBuffer.java:994)
>  at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockChecksumReconstructor.reconstruct(StripedBlockChecksumReconstructor.java:90)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.recalculateChecksum(BlockChecksumHelper.java:711)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.compute(BlockChecksumHelper.java:489)
>  at 
> 

[jira] [Commented] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt

2020-03-26 Thread guojh (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068263#comment-17068263
 ] 

guojh commented on HDFS-15237:
--

[~zhengchenyu] Have you try this patch 
(https://issues.apache.org/jira/browse/HDFS-14754)

> Get checksum of EC file failed, when some block is missing or corrupt
> -
>
> Key: HDFS-15237
> URL: https://issues.apache.org/jira/browse/HDFS-15237
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Priority: Major
> Fix For: 3.2.2
>
>
> When we distcp from an ec directory to another one, I found some error like 
> this.
> {code}
> 2020-03-20 20:18:21,366 WARN [main] 
> org.apache.hadoop.hdfs.FileChecksumHelper: src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20
>  20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: 
> src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException:
>  Unexpected EOF while trying to read response from server at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:550)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.tryDatanode(FileChecksumHelper.java:709)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlockGroup(FileChecksumHelper.java:664)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlocks(FileChecksumHelper.java:638)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$FileChecksumComputer.compute(FileChecksumHelper.java:252)
>  at 
> org.apache.hadoop.hdfs.DFSClient.getFileChecksumInternal(DFSClient.java:1790) 
> at 
> org.apache.hadoop.hdfs.DFSClient.getFileChecksumWithCombineMode(DFSClient.java:1810)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1691)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1688)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:1700)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:138)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:115)
>  at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
>  at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:259)
>  at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:220) at 
> org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48) at 
> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at 
> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
> {code}
> And Then I found some error in datanode like this
> {code}
> 2020-03-20 20:54:16,573 INFO 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: 
> SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = 
> false
> 2020-03-20 20:54:16,577 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> bd-hadoop-128050.zeus.lianjia.com:9866:DataXceiver error processing 
> BLOCK_GROUP_CHECKSUM operation src: /10.201.1.38:33264 dst: 
> /10.200.128.50:9866
> java.lang.UnsupportedOperationException
>  at java.nio.ByteBuffer.array(ByteBuffer.java:994)
>  at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockChecksumReconstructor.reconstruct(StripedBlockChecksumReconstructor.java:90)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.recalculateChecksum(BlockChecksumHelper.java:711)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.compute(BlockChecksumHelper.java:489)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.blockGroupChecksum(DataXceiver.java:1047)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opStripedBlockChecksum(Receiver.java:327)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:119)
>  

[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-03-26 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068170#comment-17068170
 ] 

Hadoop QA commented on HDFS-15240:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
33s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
13s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
21m  6s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  7m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
42s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
26s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
20s{color} | {color:red} hadoop-hdfs-client in the patch failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  1m 
42s{color} | {color:red} root in the patch failed. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  1m 42s{color} 
| {color:red} root in the patch failed. {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 47s{color} | {color:orange} The patch fails to run checkstyle in root 
{color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red}  0m 
21s{color} | {color:red} hadoop-hdfs-client in the patch failed. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} shadedclient {color} | {color:red}  3m  
0s{color} | {color:red} patch has errors when building and testing our client 
artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
19s{color} | {color:red} hadoop-hdfs-client in the patch failed. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
16s{color} | {color:red} hadoop-hdfs-client in the patch failed. {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  8m 
49s{color} | {color:green} hadoop-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 19s{color} 
| {color:red} hadoop-hdfs-client in the patch failed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 91m 16s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
35s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}188m  1s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:4454c6d14b7 |
| JIRA Issue | HDFS-15240 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12997870/HDFS-15240.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 4d271edefa4e 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 

[jira] [Commented] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl

2020-03-26 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068119#comment-17068119
 ] 

Íñigo Goiri commented on HDFS-15242:


+1 on [^HDFS-15242.003.patch].

> Add metrics for operations hold lock times of FsDatasetImpl
> ---
>
> Key: HDFS-15242
> URL: https://issues.apache.org/jira/browse/HDFS-15242
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15242.001.patch, HDFS-15242.002.patch, 
> HDFS-15242.003.patch
>
>
> Some operations of FsDatasetImpl need to hold Lock, and sometimes it costs 
> long time to execute since it include IO operation in Lock. I propose to add 
> metrics for this operations then it could be more convenient for monitor and 
> dig bottleneck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl

2020-03-26 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068091#comment-17068091
 ] 

Hadoop QA commented on HDFS-15242:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
51s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
10s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m  
8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
20m 34s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
39s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
21s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 15m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m  4s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  6m  
8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
39s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  9m  
7s{color} | {color:green} hadoop-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}115m  7s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
53s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}239m  4s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks |
|   | hadoop.hdfs.server.balancer.TestBalancer |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:4454c6d14b7 |
| JIRA Issue | HDFS-15242 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12997922/HDFS-15242.003.patch |
| Optional Tests |  dupname  asflicense  mvnsite  compile  javac  javadoc  
mvninstall  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 2742ac603bf8 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 745a6c1 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_242 |
| findbugs | v3.1.0-RC1 |
| unit | 

[jira] [Updated] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-03-26 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-15240:
---
Status: Patch Available  (was: Open)

> Erasure Coding: dirty buffer causes reconstruction block error
> --
>
> Key: HDFS-15240
> URL: https://issues.apache.org/jira/browse/HDFS-15240
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
> Attachments: HDFS-15240.001.patch
>
>
> When read some lzo files we found some blocks were broken.
> I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from 
> DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') 
> blocks. And find the longest common sequenece(LCS) between b6'(decoded) and 
> b6(read from DN)(b7'/b7 and b8'/b8).
> After selecting 6 blocks of the block group in combinations one time and 
> iterating through all cases, I find one case that the length of LCS is the 
> block length - 64KB, 64KB is just the length of ByteBuffer used by 
> StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
> buffer.
> The following log snippet(only show 2 of 28 cases) is my check program 
> output. In my case, I known the 3th block is corrupt, so need other 5 blocks 
> to decode another 3 blocks, then find the 1th block's LCS substring is block 
> length - 64kb.
> It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the 
> dirty buffer was used before read the 1th block.
> Must be noted that StripedBlockReader read from the offset 0 of the 1th block 
> after used the dirty buffer.
> {code:java}
> decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
> Check Block(1) first 131072 bytes longest common substring length 4
> Check Block(6) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4
> decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
> Check Block(1) first 131072 bytes longest common substring length 65536
> CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 
> 27197440  # this one
> Check Block(7) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4{code}
> Now I know the dirty buffer causes reconstruction block error, but how does 
> the dirty buffer come about?
> After digging into the code and DN log, I found this following DN log is the 
> root reason.
> {code:java}
> [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/:52586 
> remote=/:50010]. 18 millis timeout left.
> [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped 
> block: BP-714356632--1519726836856:blk_-YY_3472979393
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834) {code}
> Reading from DN may timeout(hold by a future(F)) and output the INFO log, but 
> the futures that contains the future(F)  is cleared, 
> {code:java}
> return new StripingChunkReadResult(futures.remove(future),
> StripingChunkReadResult.CANCELLED); {code}
> futures.remove(future) cause NPE. So the EC reconstruction is failed. In the 
> finally phase, the code snippet in *getStripedReader().close()* 
> {code:java}
> reconstructor.freeBuffer(reader.getReadBuffer());
> reader.freeReadBuffer();
> reader.closeBlockReader(); {code}
> free buffer firstly, but the StripedBlockReader still holds the buffer and 
> write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For 

[jira] [Commented] (HDFS-15169) RBF: Router FSCK should consider the mount table

2020-03-26 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067989#comment-17067989
 ] 

Íñigo Goiri commented on HDFS-15169:


Thanks [~hexiaoqiao] for the unit test, it covers my concerns.
I'm thinking about refactoring the main code to:
{code}
/**
 * Get path location using path parameter of fsck request from client.
 */
private Set getNameSpacesForPath(String[] paths) {
  if (paths == null || paths.length == 0) {
return null;
  }
  String path = paths[0];
  PathLocation pathLocation =
  router.getSubclusterResolver().getDestinationForPath(path);
  return pathLocation.getNamespaces();
}

/**
 * Redirect the request to certain active downstream NameNode if resolve
 * target namespace otherwise redirect the requests to all active
 * downstream NameNodes.
 */
private List getMembershipsForPath(
final Set nss, final List memberships) {
  Set nss = getNameSpacesForPath(paths);
  if (nss == null || nss.isEmpty()) {
return memberships;
  }
  List targetMemberships = new ArrayList<>();
  for (String ns : nss) {
for (MembershipState ms : memberships) {
  if (ms.getState() == FederationNamenodeServiceState.ACTIVE
  && ns.equals(ms.getNameserviceId())) {
targetMemberships.add(ms);
  }
}
  }
  return targetMemberships;
}

public void fsck() {
  ...
  String[] paths = pmap.get("path");
  List targetMemberships = getMembershipsForPath(
  paths, memberships);

  for (MembershipState nn : targetMemberships) {
...
}
{code}

> RBF: Router FSCK should consider the mount table
> 
>
> Key: HDFS-15169
> URL: https://issues.apache.org/jira/browse/HDFS-15169
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Akira Ajisaka
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15169.001.patch, HDFS-15169.002.patch
>
>
> HDFS-13989 implemented FSCK to DFSRouter, however, it just redirects the 
> requests to all the active downstream NameNodes for now. The DFSRouter should 
> consider the mount table when redirecting the requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15169) RBF: Router FSCK should consider the mount table

2020-03-26 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067989#comment-17067989
 ] 

Íñigo Goiri edited comment on HDFS-15169 at 3/26/20, 7:52 PM:
--

Thanks [~hexiaoqiao] for the unit test, it covers my concerns.
I'm thinking about refactoring the main code to:
{code}
/**
 * Get path location using path parameter of fsck request from client.
 */
private Set getNameSpacesForPath(String[] paths) {
  if (paths == null || paths.length == 0) {
return null;
  }
  String path = paths[0];
  PathLocation pathLocation =
  router.getSubclusterResolver().getDestinationForPath(path);
  return pathLocation.getNamespaces();
}

/**
 * Redirect the request to certain active downstream NameNode if resolve
 * target namespace otherwise redirect the requests to all active
 * downstream NameNodes.
 */
private List getMembershipsForPath(
final Set nss, final List memberships) {
  Set nss = getNameSpacesForPath(paths);
  if (nss == null || nss.isEmpty()) {
return memberships;
  }
  List targetMemberships = new ArrayList<>();
  for (String ns : nss) {
for (MembershipState ms : memberships) {
  if (ms.getState() == FederationNamenodeServiceState.ACTIVE
  && ns.equals(ms.getNameserviceId())) {
targetMemberships.add(ms);
  }
}
  }
  return targetMemberships;
}

public void fsck() {
  ...
  String[] paths = pmap.get("path");
  List targetMemberships = getMembershipsForPath(
  paths, memberships);

  for (MembershipState nn : targetMemberships) {
...
}
{code}

BTW, let's fix the checkstyles and the failed unit test might be related.


was (Author: elgoiri):
Thanks [~hexiaoqiao] for the unit test, it covers my concerns.
I'm thinking about refactoring the main code to:
{code}
/**
 * Get path location using path parameter of fsck request from client.
 */
private Set getNameSpacesForPath(String[] paths) {
  if (paths == null || paths.length == 0) {
return null;
  }
  String path = paths[0];
  PathLocation pathLocation =
  router.getSubclusterResolver().getDestinationForPath(path);
  return pathLocation.getNamespaces();
}

/**
 * Redirect the request to certain active downstream NameNode if resolve
 * target namespace otherwise redirect the requests to all active
 * downstream NameNodes.
 */
private List getMembershipsForPath(
final Set nss, final List memberships) {
  Set nss = getNameSpacesForPath(paths);
  if (nss == null || nss.isEmpty()) {
return memberships;
  }
  List targetMemberships = new ArrayList<>();
  for (String ns : nss) {
for (MembershipState ms : memberships) {
  if (ms.getState() == FederationNamenodeServiceState.ACTIVE
  && ns.equals(ms.getNameserviceId())) {
targetMemberships.add(ms);
  }
}
  }
  return targetMemberships;
}

public void fsck() {
  ...
  String[] paths = pmap.get("path");
  List targetMemberships = getMembershipsForPath(
  paths, memberships);

  for (MembershipState nn : targetMemberships) {
...
}
{code}

> RBF: Router FSCK should consider the mount table
> 
>
> Key: HDFS-15169
> URL: https://issues.apache.org/jira/browse/HDFS-15169
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Akira Ajisaka
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15169.001.patch, HDFS-15169.002.patch
>
>
> HDFS-13989 implemented FSCK to DFSRouter, however, it just redirects the 
> requests to all the active downstream NameNodes for now. The DFSRouter should 
> consider the mount table when redirecting the requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl

2020-03-26 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15242:
---
Attachment: HDFS-15242.003.patch

> Add metrics for operations hold lock times of FsDatasetImpl
> ---
>
> Key: HDFS-15242
> URL: https://issues.apache.org/jira/browse/HDFS-15242
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15242.001.patch, HDFS-15242.002.patch, 
> HDFS-15242.003.patch
>
>
> Some operations of FsDatasetImpl need to hold Lock, and sometimes it costs 
> long time to execute since it include IO operation in Lock. I propose to add 
> metrics for this operations then it could be more convenient for monitor and 
> dig bottleneck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl

2020-03-26 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067915#comment-17067915
 ] 

Xiaoqiao He commented on HDFS-15242:


Thanks [~elgoiri] for your reviews. v003 fix the typo. PTAL.

> Add metrics for operations hold lock times of FsDatasetImpl
> ---
>
> Key: HDFS-15242
> URL: https://issues.apache.org/jira/browse/HDFS-15242
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15242.001.patch, HDFS-15242.002.patch, 
> HDFS-15242.003.patch
>
>
> Some operations of FsDatasetImpl need to hold Lock, and sometimes it costs 
> long time to execute since it include IO operation in Lock. I propose to add 
> metrics for this operations then it could be more convenient for monitor and 
> dig bottleneck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-26 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067897#comment-17067897
 ] 

Chen Liang commented on HDFS-15191:
---

Hey [~Steven Rand],  Sorry I did plan to take another look, but have been busy 
recently. Will take a look today or tomorrow

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, 
> HDFS-15191.003.patch, HDFS-15191.004.patch
>
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging 
> [~vagarychen] in case you have any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-03-26 Thread HuangTao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067871#comment-17067871
 ] 

HuangTao commented on HDFS-15240:
-

I will try to add an UT

> Erasure Coding: dirty buffer causes reconstruction block error
> --
>
> Key: HDFS-15240
> URL: https://issues.apache.org/jira/browse/HDFS-15240
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
> Attachments: HDFS-15240.001.patch
>
>
> When read some lzo files we found some blocks were broken.
> I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from 
> DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') 
> blocks. And find the longest common sequenece(LCS) between b6'(decoded) and 
> b6(read from DN)(b7'/b7 and b8'/b8).
> After selecting 6 blocks of the block group in combinations one time and 
> iterating through all cases, I find one case that the length of LCS is the 
> block length - 64KB, 64KB is just the length of ByteBuffer used by 
> StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
> buffer.
> The following log snippet(only show 2 of 28 cases) is my check program 
> output. In my case, I known the 3th block is corrupt, so need other 5 blocks 
> to decode another 3 blocks, then find the 1th block's LCS substring is block 
> length - 64kb.
> It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the 
> dirty buffer was used before read the 1th block.
> Must be noted that StripedBlockReader read from the offset 0 of the 1th block 
> after used the dirty buffer.
> {code:java}
> decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
> Check Block(1) first 131072 bytes longest common substring length 4
> Check Block(6) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4
> decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
> Check Block(1) first 131072 bytes longest common substring length 65536
> CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 
> 27197440  # this one
> Check Block(7) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4{code}
> Now I know the dirty buffer causes reconstruction block error, but how does 
> the dirty buffer come about?
> After digging into the code and DN log, I found this following DN log is the 
> root reason.
> {code:java}
> [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/:52586 
> remote=/:50010]. 18 millis timeout left.
> [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped 
> block: BP-714356632--1519726836856:blk_-YY_3472979393
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834) {code}
> Reading from DN may timeout(hold by a future(F)) and output the INFO log, but 
> the futures that contains the future(F)  is cleared, 
> {code:java}
> return new StripingChunkReadResult(futures.remove(future),
> StripingChunkReadResult.CANCELLED); {code}
> futures.remove(future) cause NPE. So the EC reconstruction is failed. In the 
> finally phase, the code snippet in *getStripedReader().close()* 
> {code:java}
> reconstructor.freeBuffer(reader.getReadBuffer());
> reader.freeReadBuffer();
> reader.closeBlockReader(); {code}
> free buffer firstly, but the StripedBlockReader still holds the buffer and 
> write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org

[jira] [Commented] (HDFS-13470) RBF: Add Browse the Filesystem button to the UI

2020-03-26 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067792#comment-17067792
 ] 

Hudson commented on HDFS-13470:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18096 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18096/])
HDFS-13470. RBF: Add Browse the Filesystem button to the UI. (inigoiri: rev 
679631b1885cffaf8fc8d2677da15152db139065)
* (add) hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/webapps/router/explorer.js
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/webapps/router/federationhealth.html
* (add) 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/webapps/router/explorer.html


> RBF: Add Browse the Filesystem button to the UI
> ---
>
> Key: HDFS-13470
> URL: https://issues.apache.org/jira/browse/HDFS-13470
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-13470.000.patch, HDFS-13470.001.patch, 
> HDFS-13470.002.patch
>
>
> After HDFS-12512 added WebHDFS, we can add the support to browse the 
> filesystem to the UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl

2020-03-26 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067776#comment-17067776
 ] 

Íñigo Goiri commented on HDFS-15242:


Good catch on the lock.
I think addConvertTemporaryToTbwOp(), should be addConvertTemporaryToRbwOp(), 
right?

> Add metrics for operations hold lock times of FsDatasetImpl
> ---
>
> Key: HDFS-15242
> URL: https://issues.apache.org/jira/browse/HDFS-15242
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15242.001.patch, HDFS-15242.002.patch
>
>
> Some operations of FsDatasetImpl need to hold Lock, and sometimes it costs 
> long time to execute since it include IO operation in Lock. I propose to add 
> metrics for this operations then it could be more convenient for monitor and 
> dig bottleneck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13470) RBF: Add Browse the Filesystem button to the UI

2020-03-26 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HDFS-13470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri updated HDFS-13470:
---
Fix Version/s: 3.3.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> RBF: Add Browse the Filesystem button to the UI
> ---
>
> Key: HDFS-13470
> URL: https://issues.apache.org/jira/browse/HDFS-13470
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-13470.000.patch, HDFS-13470.001.patch, 
> HDFS-13470.002.patch
>
>
> After HDFS-12512 added WebHDFS, we can add the support to browse the 
> filesystem to the UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13470) RBF: Add Browse the Filesystem button to the UI

2020-03-26 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-13470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067770#comment-17067770
 ] 

Íñigo Goiri commented on HDFS-13470:


Thanks [~ayushtkn] for the review.
Committed to trunk.

> RBF: Add Browse the Filesystem button to the UI
> ---
>
> Key: HDFS-13470
> URL: https://issues.apache.org/jira/browse/HDFS-13470
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-13470.000.patch, HDFS-13470.001.patch, 
> HDFS-13470.002.patch
>
>
> After HDFS-12512 added WebHDFS, we can add the support to browse the 
> filesystem to the UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-03-26 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067682#comment-17067682
 ] 

Fei Hui commented on HDFS-15240:


[~marvelrock] Good Catch ! Thanks for reporting and fixing.
Could you please add UT?

> Erasure Coding: dirty buffer causes reconstruction block error
> --
>
> Key: HDFS-15240
> URL: https://issues.apache.org/jira/browse/HDFS-15240
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
> Attachments: HDFS-15240.001.patch
>
>
> When read some lzo files we found some blocks were broken.
> I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from 
> DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') 
> blocks. And find the longest common sequenece(LCS) between b6'(decoded) and 
> b6(read from DN)(b7'/b7 and b8'/b8).
> After selecting 6 blocks of the block group in combinations one time and 
> iterating through all cases, I find one case that the length of LCS is the 
> block length - 64KB, 64KB is just the length of ByteBuffer used by 
> StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
> buffer.
> The following log snippet(only show 2 of 28 cases) is my check program 
> output. In my case, I known the 3th block is corrupt, so need other 5 blocks 
> to decode another 3 blocks, then find the 1th block's LCS substring is block 
> length - 64kb.
> It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the 
> dirty buffer was used before read the 1th block.
> Must be noted that StripedBlockReader read from the offset 0 of the 1th block 
> after used the dirty buffer.
> {code:java}
> decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
> Check Block(1) first 131072 bytes longest common substring length 4
> Check Block(6) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4
> decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
> Check Block(1) first 131072 bytes longest common substring length 65536
> CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 
> 27197440  # this one
> Check Block(7) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4{code}
> Now I know the dirty buffer causes reconstruction block error, but how does 
> the dirty buffer come about?
> After digging into the code and DN log, I found this following DN log is the 
> root reason.
> {code:java}
> [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/:52586 
> remote=/:50010]. 18 millis timeout left.
> [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped 
> block: BP-714356632--1519726836856:blk_-YY_3472979393
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834) {code}
> Reading from DN may timeout(hold by a future(F)) and output the INFO log, but 
> the futures that contains the future(F)  is cleared, 
> {code:java}
> return new StripingChunkReadResult(futures.remove(future),
> StripingChunkReadResult.CANCELLED); {code}
> futures.remove(future) cause NPE. So the EC reconstruction is failed. In the 
> finally phase, the code snippet in *getStripedReader().close()* 
> {code:java}
> reconstructor.freeBuffer(reader.getReadBuffer());
> reader.freeReadBuffer();
> reader.closeBlockReader(); {code}
> free buffer firstly, but the StripedBlockReader still holds the buffer and 
> write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To 

[jira] [Commented] (HDFS-15235) Transient network failure during NameNode failover makes cluster unavailable

2020-03-26 Thread YCozy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067649#comment-17067649
 ] 

YCozy commented on HDFS-15235:
--

[~ayushtkn], could you please take a look at the patch? Thanks!

> Transient network failure during NameNode failover makes cluster unavailable
> 
>
> Key: HDFS-15235
> URL: https://issues.apache.org/jira/browse/HDFS-15235
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: YCozy
>Assignee: YCozy
>Priority: Major
> Attachments: HDFS-15235.001.patch
>
>
> We have an HA cluster with two NameNodes: an active NN1 and a standby NN2. At 
> some point, NN1 becomes unhealthy and the admin tries to manually failover to 
> NN2 by running command
> {code:java}
> $ hdfs haadmin -failover NN1 NN2
> {code}
> NN2 receives the request and becomes active:
> {code:java}
> 2020-03-24 00:24:56,412 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services 
> started for standby state
> 2020-03-24 00:24:56,413 WARN 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer 
> interrupted: sleep interrupted
> 2020-03-24 00:24:56,415 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services 
> required for active state
> 2020-03-24 00:24:56,417 INFO 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering 
> unfinalized segments in /app/ha-name-dir-shared/current
> 2020-03-24 00:24:56,419 INFO 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering 
> unfinalized segments in /app/nn2/name/current
> 2020-03-24 00:24:56,419 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Catching up to latest 
> edits from old active before taking over writer role in edits logs
> 2020-03-24 00:24:56,435 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Reading 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@7c3095fa 
> expecting start txid #1
> 2020-03-24 00:24:56,436 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Start loading edits file 
> /app/ha-name-dir-shared/current/edits_001-019 
> maxTxnsToRead = 9223372036854775807
> 2020-03-24 00:24:56,441 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 
> '/app/ha-name-dir-shared/current/edits_001-019'
>  to transaction ID 1
> 2020-03-24 00:24:56,567 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Loaded 1 edits file(s) (the last named 
> /app/ha-name-dir-shared/current/edits_001-019)
>  of total size 1305.0, total edits 19.0, total load time 109.0 ms
> 2020-03-24 00:24:56,567 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all 
> datanodes as stale
> 2020-03-24 00:24:56,568 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Processing 4 
> messages from DataNodes that were previously queued during standby state
> 2020-03-24 00:24:56,569 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing replication 
> and invalidation queues
> 2020-03-24 00:24:56,569 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: initializing 
> replication queues
> 2020-03-24 00:24:56,570 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Will take over writing 
> edit logs at txnid 20
> 2020-03-24 00:24:56,571 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 20
> 2020-03-24 00:24:56,812 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Initializing quota with 4 
> thread(s)
> 2020-03-24 00:24:56,819 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Quota initialization 
> completed in 6 millisecondsname space=3storage space=24690storage 
> types=RAM_DISK=0, SSD=0, DISK=0, ARCHIVE=0, PROVIDED=0
> 2020-03-24 00:24:56,827 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: 
> Starting CacheReplicationMonitor with interval 3 milliseconds
> {code}
> But NN2 fails to send back the RPC response because of temporary network 
> partitioning.
> {code:java}
> java.io.EOFException: End of File Exception between local host is: 
> "24e7b5a52e85/172.17.0.2"; destination host is: "127.0.0.3":8180; : 
> java.io.EOFException; For more details see:  
> http://wiki.apache.org/hadoop/EOFException
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at 

[jira] [Updated] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-03-26 Thread HuangTao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

HuangTao updated HDFS-15240:

Description: 
When read some lzo files we found some blocks were broken.

I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from DN 
directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. 
And find the longest common sequenece(LCS) between b6'(decoded) and b6(read 
from DN)(b7'/b7 and b8'/b8).

After selecting 6 blocks of the block group in combinations one time and 
iterating through all cases, I find one case that the length of LCS is the 
block length - 64KB, 64KB is just the length of ByteBuffer used by 
StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
buffer.

The following log snippet(only show 2 of 28 cases) is my check program output. 
In my case, I known the 3th block is corrupt, so need other 5 blocks to decode 
another 3 blocks, then find the 1th block's LCS substring is block length - 
64kb.

It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the 
dirty buffer was used before read the 1th block.

Must be noted that StripedBlockReader read from the offset 0 of the 1th block 
after used the dirty buffer.
{code:java}
decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
Check Block(1) first 131072 bytes longest common substring length 4
Check Block(6) first 131072 bytes longest common substring length 4
Check Block(8) first 131072 bytes longest common substring length 4
decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
Check Block(1) first 131072 bytes longest common substring length 65536
CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 
27197440  # this one
Check Block(7) first 131072 bytes longest common substring length 4
Check Block(8) first 131072 bytes longest common substring length 4{code}
Now I know the dirty buffer causes reconstruction block error, but how does the 
dirty buffer come about?

After digging into the code and DN log, I found this following DN log is the 
root reason.
{code:java}
[INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel 
java.nio.channels.SocketChannel[connected local=/:52586 
remote=/:50010]. 18 millis timeout left.
[WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped block: 
BP-714356632--1519726836856:blk_-YY_3472979393
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834) {code}
Reading from DN may timeout(hold by a future(F)) and output the INFO log, but 
the futures that contains the future(F)  is cleared, 
{code:java}
return new StripingChunkReadResult(futures.remove(future),
StripingChunkReadResult.CANCELLED); {code}
futures.remove(future) cause NPE. So the EC reconstruction is failed. In the 
finally phase, the code snippet in *getStripedReader().close()* 
{code:java}
reconstructor.freeBuffer(reader.getReadBuffer());
reader.freeReadBuffer();
reader.closeBlockReader(); {code}
free buffer firstly, but the StripedBlockReader still holds the buffer and 
write it.

  was:
When read some lzo files we found some blocks were broken.

I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from DN 
directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. 
And find the longest common sequenece(LCS) between b6'(decoded) and b6(read 
from DN)(b7'/b7 and b8'/b8).

After selecting 6 blocks of the block group in combinations one time and 
iterating through all cases, I find one case that the length of LCS is the 
block length - 64KB, 64KB is just the length of ByteBuffer used by 
StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
buffer.

The following log snippet(only show 2 of 28 cases) is my check program output. 
In my case, I known the 3th block is corrupt, so need other 5 blocks to decode 
another 3 blocks, then find the 1th block's LCS substring is block length 

[jira] [Updated] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-03-26 Thread HuangTao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

HuangTao updated HDFS-15240:

Description: 
When read some lzo files we found some blocks were broken.

I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from DN 
directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. 
And find the longest common sequenece(LCS) between b6'(decoded) and b6(read 
from DN)(b7'/b7 and b8'/b8).

After selecting 6 blocks of the block group in combinations one time and 
iterating through all cases, I find one case that the length of LCS is the 
block length - 64KB, 64KB is just the length of ByteBuffer used by 
StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
buffer.

The following log snippet(only show 2 of 28 cases) is my check program output. 
In my case, I known the 3th block is corrupt, so need other 5 blocks to decode 
another 3 blocks, then find the 1th block's LCS substring is block length - 
64kb.

It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the 
dirty buffer was used before read the 1th block.
{code:java}
decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
Check Block(1) first 131072 bytes longest common substring length 4
Check Block(6) first 131072 bytes longest common substring length 4
Check Block(8) first 131072 bytes longest common substring length 4
decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
Check Block(1) first 131072 bytes longest common substring length 65536
CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 
27197440  # this one
Check Block(7) first 131072 bytes longest common substring length 4
Check Block(8) first 131072 bytes longest common substring length 4{code}
Now I know the dirty buffer causes reconstruction block error, but how does the 
dirty buffer come about?

After digging into the code and DN log, I found this following DN log is the 
root reason.
{code:java}
[INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel 
java.nio.channels.SocketChannel[connected local=/:52586 
remote=/:50010]. 18 millis timeout left.
[WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped block: 
BP-714356632--1519726836856:blk_-YY_3472979393
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834) {code}
Reading from DN may timeout(hold by a future(F)) and output the INFO log, but 
the futures that contains the future(F)  is cleared, 
{code:java}
return new StripingChunkReadResult(futures.remove(future),
StripingChunkReadResult.CANCELLED); {code}
futures.remove(future) cause NPE. So the EC reconstruction is failed. In the 
finally phase, the code snippet in *getStripedReader().close()* 
{code:java}
reconstructor.freeBuffer(reader.getReadBuffer());
reader.freeReadBuffer();
reader.closeBlockReader(); {code}
free buffer firstly, but the StripedBlockReader still holds the buffer and 
write it.

  was:
When read some lzo files we found some blocks were broken.

I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k), and 
choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the 
longest common sequenece(LCS) between b6' and b6(b7'/b7 and b8'/b8).

After selecting 6 blocks of the block group in combinations one time and 
iterating through all cases, I find one case that the length of LCS is the 
block length - 64KB, 64KB is just the length of ByteBuffer used by 
StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
buffer.

The following log snippet is my check program output
{code:java}
decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
Check Block(1) first 131072 bytes longest common substring length 4
Check Block(6) first 131072 bytes longest common substring length 4
Check Block(8) first 131072 bytes longest common substring length 4
decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
Check Block(1) first 

[jira] [Commented] (HDFS-13470) RBF: Add Browse the Filesystem button to the UI

2020-03-26 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067643#comment-17067643
 ] 

Ayush Saxena commented on HDFS-13470:
-

Thanx [~elgoiri] for the update.
v002 LGTM +1

> RBF: Add Browse the Filesystem button to the UI
> ---
>
> Key: HDFS-13470
> URL: https://issues.apache.org/jira/browse/HDFS-13470
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Major
> Attachments: HDFS-13470.000.patch, HDFS-13470.001.patch, 
> HDFS-13470.002.patch
>
>
> After HDFS-12512 added WebHDFS, we can add the support to browse the 
> filesystem to the UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-03-26 Thread Yao Guangdong (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067569#comment-17067569
 ] 

Yao Guangdong commented on HDFS-15240:
--

 [~weichiu] We have the same problem. PTAL.

> Erasure Coding: dirty buffer causes reconstruction block error
> --
>
> Key: HDFS-15240
> URL: https://issues.apache.org/jira/browse/HDFS-15240
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
> Attachments: HDFS-15240.001.patch
>
>
> When read some lzo files we found some blocks were broken.
> I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k), and 
> choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the 
> longest common sequenece(LCS) between b6' and b6(b7'/b7 and b8'/b8).
> After selecting 6 blocks of the block group in combinations one time and 
> iterating through all cases, I find one case that the length of LCS is the 
> block length - 64KB, 64KB is just the length of ByteBuffer used by 
> StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
> buffer.
> The following log snippet is my check program output
> {code:java}
> decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
> Check Block(1) first 131072 bytes longest common substring length 4
> Check Block(6) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4
> decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
> Check Block(1) first 131072 bytes longest common substring length 65536
> CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 
> 27197440  # this one
> Check Block(7) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4{code}
> Now I know the dirty buffer causes reconstruction block error, but how does 
> the dirty buffer come about?
> After digging into the code and DN log, I found this following DN log is the 
> root reason.
> {code:java}
> [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/:52586 
> remote=/:50010]. 18 millis timeout left.
> [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped 
> block: BP-714356632--1519726836856:blk_-YY_3472979393
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834) {code}
> Reading from DN may timeout(hold by a future(F)) and output the INFO log, but 
> the futures that contains the future(F)  is cleared, 
> {code:java}
> return new StripingChunkReadResult(futures.remove(future),
> StripingChunkReadResult.CANCELLED); {code}
> futures.remove(future) cause NPE. So the EC reconstruction is failed. In the 
> finally phase, the code snippet in *getStripedReader().close()* 
> {code:java}
> reconstructor.freeBuffer(reader.getReadBuffer());
> reader.freeReadBuffer();
> reader.closeBlockReader(); {code}
> free buffer firstly, but the StripedBlockReader still holds the buffer and 
> write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-03-26 Thread guojh (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067559#comment-17067559
 ] 

guojh commented on HDFS-15240:
--

[~surendrasingh]  We have the same problem,please take a look.

> Erasure Coding: dirty buffer causes reconstruction block error
> --
>
> Key: HDFS-15240
> URL: https://issues.apache.org/jira/browse/HDFS-15240
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
> Attachments: HDFS-15240.001.patch
>
>
> When read some lzo files we found some blocks were broken.
> I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k), and 
> choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the 
> longest common sequenece(LCS) between b6' and b6(b7'/b7 and b8'/b8).
> After selecting 6 blocks of the block group in combinations one time and 
> iterating through all cases, I find one case that the length of LCS is the 
> block length - 64KB, 64KB is just the length of ByteBuffer used by 
> StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
> buffer.
> The following log snippet is my check program output
> {code:java}
> decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
> Check Block(1) first 131072 bytes longest common substring length 4
> Check Block(6) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4
> decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
> Check Block(1) first 131072 bytes longest common substring length 65536
> CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 
> 27197440  # this one
> Check Block(7) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4{code}
> Now I know the dirty buffer causes reconstruction block error, but how does 
> the dirty buffer come about?
> After digging into the code and DN log, I found this following DN log is the 
> root reason.
> {code:java}
> [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/:52586 
> remote=/:50010]. 18 millis timeout left.
> [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped 
> block: BP-714356632--1519726836856:blk_-YY_3472979393
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834) {code}
> Reading from DN may timeout(hold by a future(F)) and output the INFO log, but 
> the futures that contains the future(F)  is cleared, 
> {code:java}
> return new StripingChunkReadResult(futures.remove(future),
> StripingChunkReadResult.CANCELLED); {code}
> futures.remove(future) cause NPE. So the EC reconstruction is failed. In the 
> finally phase, the code snippet in *getStripedReader().close()* 
> {code:java}
> reconstructor.freeBuffer(reader.getReadBuffer());
> reader.freeReadBuffer();
> reader.closeBlockReader(); {code}
> free buffer firstly, but the StripedBlockReader still holds the buffer and 
> write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-03-26 Thread HuangTao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067539#comment-17067539
 ] 

HuangTao commented on HDFS-15240:
-

[~weichiu] PTAL, Thanks

> Erasure Coding: dirty buffer causes reconstruction block error
> --
>
> Key: HDFS-15240
> URL: https://issues.apache.org/jira/browse/HDFS-15240
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
> Attachments: HDFS-15240.001.patch
>
>
> When read some lzo files we found some blocks were broken.
> I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k), and 
> choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the 
> longest common sequenece(LCS) between b6' and b6(b7'/b7 and b8'/b8).
> After selecting 6 blocks of the block group in combinations one time and 
> iterating through all cases, I find one case that the length of LCS is the 
> block length - 64KB, 64KB is just the length of ByteBuffer used by 
> StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
> buffer.
> The following log snippet is my check program output
> {code:java}
> decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
> Check Block(1) first 131072 bytes longest common substring length 4
> Check Block(6) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4
> decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
> Check Block(1) first 131072 bytes longest common substring length 65536
> CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 
> 27197440  # this one
> Check Block(7) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4{code}
> Now I know the dirty buffer causes reconstruction block error, but how does 
> the dirty buffer come about?
> After digging into the code and DN log, I found this following DN log is the 
> root reason.
> {code:java}
> [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/:52586 
> remote=/:50010]. 18 millis timeout left.
> [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped 
> block: BP-714356632--1519726836856:blk_-YY_3472979393
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834) {code}
> Reading from DN may timeout(hold by a future(F)) and output the INFO log, but 
> the futures that contains the future(F)  is cleared, 
> {code:java}
> return new StripingChunkReadResult(futures.remove(future),
> StripingChunkReadResult.CANCELLED); {code}
> futures.remove(future) cause NPE. So the EC reconstruction is failed. In the 
> finally phase, the code snippet in *getStripedReader().close()* 
> {code:java}
> reconstructor.freeBuffer(reader.getReadBuffer());
> reader.freeReadBuffer();
> reader.closeBlockReader(); {code}
> free buffer firstly, but the StripedBlockReader still holds the buffer and 
> write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-03-26 Thread HuangTao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

HuangTao updated HDFS-15240:

Attachment: HDFS-15240.001.patch

> Erasure Coding: dirty buffer causes reconstruction block error
> --
>
> Key: HDFS-15240
> URL: https://issues.apache.org/jira/browse/HDFS-15240
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
> Attachments: HDFS-15240.001.patch
>
>
> When read some lzo files we found some blocks were broken.
> I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k), and 
> choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the 
> longest common sequenece(LCS) between b6' and b6(b7'/b7 and b8'/b8).
> After selecting 6 blocks of the block group in combinations one time and 
> iterating through all cases, I find one case that the length of LCS is the 
> block length - 64KB, 64KB is just the length of ByteBuffer used by 
> StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
> buffer.
> The following log snippet is my check program output
> {code:java}
> decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
> Check Block(1) first 131072 bytes longest common substring length 4
> Check Block(6) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4
> decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
> Check Block(1) first 131072 bytes longest common substring length 65536
> CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 
> 27197440  # this one
> Check Block(7) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4{code}
> Now I know the dirty buffer causes reconstruction block error, but how does 
> the dirty buffer come about?
> After digging into the code and DN log, I found this following DN log is the 
> root reason.
> {code:java}
> [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/:52586 
> remote=/:50010]. 18 millis timeout left.
> [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped 
> block: BP-714356632--1519726836856:blk_-YY_3472979393
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834) {code}
> Reading from DN may timeout(hold by a future(F)) and output the INFO log, but 
> the futures that contains the future(F)  is cleared, 
> {code:java}
> return new StripingChunkReadResult(futures.remove(future),
> StripingChunkReadResult.CANCELLED); {code}
> futures.remove(future) cause NPE. So the EC reconstruction is failed. In the 
> finally phase, the code snippet in *getStripedReader().close()* 
> {code:java}
> reconstructor.freeBuffer(reader.getReadBuffer());
> reader.freeReadBuffer();
> reader.closeBlockReader(); {code}
> free buffer firstly, but the StripedBlockReader still holds the buffer and 
> write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15243) Child directory should not be deleted or renamed if parent directory is a protected directory

2020-03-26 Thread liuyanyu (Jira)
liuyanyu created HDFS-15243:
---

 Summary: Child directory should not be deleted or renamed if 
parent directory is a protected directory
 Key: HDFS-15243
 URL: https://issues.apache.org/jira/browse/HDFS-15243
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: 3.1.1
Affects Versions: 3.1.1
Reporter: liuyanyu


HDFS-8983 add  fs.protected.directories to support protected directories on 
NameNode.  But as I test, when set a parent directory(eg /testA)  to protected 
directory, the child directory (eg /testA/testB) still can be deleted or 
renamed. When we protect a directory  mainly for protecting the data under this 
directory , So I think the child directory should not be delete or renamed if 
the parent directory is a protected directory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-03-26 Thread HuangTao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

HuangTao updated HDFS-15240:

Description: 
When read some lzo files we found some blocks were broken.

I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k), and 
choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the 
longest common sequenece(LCS) between b6' and b6(b7'/b7 and b8'/b8).

After selecting 6 blocks of the block group in combinations one time and 
iterating through all cases, I find one case that the length of LCS is the 
block length - 64KB, 64KB is just the length of ByteBuffer used by 
StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
buffer.

The following log snippet is my check program output
{code:java}
decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
Check Block(1) first 131072 bytes longest common substring length 4
Check Block(6) first 131072 bytes longest common substring length 4
Check Block(8) first 131072 bytes longest common substring length 4
decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
Check Block(1) first 131072 bytes longest common substring length 65536
CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 
27197440  # this one
Check Block(7) first 131072 bytes longest common substring length 4
Check Block(8) first 131072 bytes longest common substring length 4{code}
Now I know the dirty buffer causes reconstruction block error, but how does the 
dirty buffer come about?

After digging into the code and DN log, I found this following DN log is the 
root reason.
{code:java}
[INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel 
java.nio.channels.SocketChannel[connected local=/:52586 
remote=/:50010]. 18 millis timeout left.
[WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped block: 
BP-714356632--1519726836856:blk_-YY_3472979393
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834) {code}
Reading from DN may timeout(hold by a future(F)) and output the INFO log, but 
the futures that contains the future(F)  is cleared, 
{code:java}
return new StripingChunkReadResult(futures.remove(future),
StripingChunkReadResult.CANCELLED); {code}
futures.remove(future) cause NPE. So the EC reconstruction is failed. In the 
finally phase, the code snippet in *getStripedReader().close()* 
{code:java}
reconstructor.freeBuffer(reader.getReadBuffer());
reader.freeReadBuffer();
reader.closeBlockReader(); {code}
free buffer firstly, but the StripedBlockReader still holds the buffer and 
write it.

  was:
When read some lzo files we found some blocks were broken.

I read back all internal blocks of the block group(RS-6-3-1024k), and choose 6 
blocks to decode other 3 block


> Erasure Coding: dirty buffer causes reconstruction block error
> --
>
> Key: HDFS-15240
> URL: https://issues.apache.org/jira/browse/HDFS-15240
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
>
> When read some lzo files we found some blocks were broken.
> I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k), and 
> choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') blocks. And find the 
> longest common sequenece(LCS) between b6' and b6(b7'/b7 and b8'/b8).
> After selecting 6 blocks of the block group in combinations one time and 
> iterating through all cases, I find one case that the length of LCS is the 
> block length - 64KB, 64KB is just the length of ByteBuffer used by 
> StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
> buffer.
> The following log snippet is my check program output
> {code:java}
> decode from [0, 2, 3, 4, 5, 7] -> [1, 

[jira] [Commented] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-26 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067520#comment-17067520
 ] 

Steven Rand commented on HDFS-15191:


Hi [~vagarychen],

I'm wondering if you can take a look at the patch, or if you know someone else 
who could?

Also, I wonder whether this is important enough to be part of the 3.3.0 
release? As far as I can tell, it breaks backcompat with 2.x clusters where 
SASL is being used, which seems like a significant regression to me.

Thanks,
Steve

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, 
> HDFS-15191.003.patch, HDFS-15191.004.patch
>
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging 
> [~vagarychen] in case you have any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl

2020-03-26 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067482#comment-17067482
 ] 

Hadoop QA commented on HDFS-15242:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
46s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
20s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
31s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
20m 13s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
38s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
20s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 15m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 39s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
45s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  9m 14s{color} 
| {color:red} hadoop-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}108m 52s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
48s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}230m  3s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.security.TestFixKerberosTicketOrder |
|   | hadoop.security.TestRaceWhenRelogin |
|   | hadoop.hdfs.server.namenode.ha.TestBootstrapAliasmap |
|   | hadoop.hdfs.server.balancer.TestBalancer |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:4454c6d14b7 |
| JIRA Issue | HDFS-15242 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12997785/HDFS-15242.002.patch |
| Optional Tests |  dupname  asflicense  mvnsite  compile  javac  javadoc  
mvninstall  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux dd0468cedec7 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 0fa7bf4 |
| maven | version: Apache Maven 3.3.9 

[jira] [Commented] (HDFS-12733) Option to disable to namenode local edits

2020-03-26 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067442#comment-17067442
 ] 

Xiaoqiao He commented on HDFS-12733:


Thanks [~elgoiri],[~ayushtkn] for your great feedback. v008 does rely on set 
`dfs.namenode.edits.dir` to blank to disable local edit. Yes, we need more 
information to clear this changes. It is true that their actions are different, 
IMO it is feasible to unify it that we disable local edits if config blank and 
tell our end user explicitly. Of course, patch v008 is not a complete story 
currently. I would like to update the patch if no objection. Thanks again 
[~elgoiri],[~ayushtkn] for your suggestions.

> Option to disable to namenode local edits
> -
>
> Key: HDFS-12733
> URL: https://issues.apache.org/jira/browse/HDFS-12733
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode, performance
>Reporter: Brahma Reddy Battula
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-12733-001.patch, HDFS-12733-002.patch, 
> HDFS-12733-003.patch, HDFS-12733.004.patch, HDFS-12733.005.patch, 
> HDFS-12733.006.patch, HDFS-12733.007.patch, HDFS-12733.008.patch
>
>
> As of now, Edits will be written in local and shared locations which will be 
> redundant and local edits never used in HA setup.
> Disabling local edits gives little performance improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org