[
https://issues.apache.org/jira/browse/HDFS-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang updated HDFS-16161:
-----------------------------------
Description:
One of our user reported this error in the log:
{noformat}
2021-07-30 09:51:27,509 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
an02nphda5777.npa.bfsiplatform.com:1004:DataXceiver error processing READ_BLOCK
operation src: /10.30.10.68:35680 dst: /10.30.10.67:1004
java.lang.IllegalArgumentException: id=-46 out of range [0, 5)
at
org.apache.hadoop.util.DataChecksum$Type.valueOf(DataChecksum.java:76)
at
org.apache.hadoop.util.DataChecksum.newDataChecksum(DataChecksum.java:167)
{noformat}
Analysis:
it looks like the first few bytes of checksum was bad. The first few bytes
determines the type of checksum (CRC32, CRC32C…etc). But the block was never
reported to NameNode and removed.
if DN throws an IOException reading a block, it starts another thread to scan
the block. If the block is indeed bad, it tells NN it’s got a bad block. But
this is an IllegalArgumentException which is a RuntimeException not an IOE so
it’s not handled that way.
its’ a bug in the error handling code. It should be made more graceful.
Suggest: catch the IllegalArgumentException in
BlockMetadataHeader.preadHeader() and throw CorruptMetaHeaderException, so that
DN catches the exception and perform the regular block scan check.
was:
One of our user reported this error in the log:
{noformat}
2021-07-30 09:51:27,509 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
an02nphda5777.npa.bfsiplatform.com:1004:DataXceiver error processing READ_BLOCK
operation src: /10.30.10.68:35680 dst: /10.30.10.67:1004
java.lang.IllegalArgumentException: id=-46 out of range [0, 5)
at
org.apache.hadoop.util.DataChecksum$Type.valueOf(DataChecksum.java:76)
at
org.apache.hadoop.util.DataChecksum.newDataChecksum(DataChecksum.java:167)
{noformat}
Analysis:
it looks like the first few bytes of checksum was bad. The first few bytes
determines the type of checksum (CRC32, CRC32C…etc).
if DN throws an IOException reading a block, it starts another thread to scan
the block. If the block is indeed bad, it tells NN it’s got a bad block. But
this is an IllegalArgumentException which is a RuntimeException not an IOE so
it’s not handled that way.
its’ a bug in the error handling code. It should be made more graceful.
Suggest: catch the IllegalArgumentException in
BlockMetadataHeader.preadHeader() and throw CorruptMetaHeaderException, so that
DN catches the exception and perform the regular block scan check.
> Corrupt block checksum is not reported to NameNode
> --------------------------------------------------
>
> Key: HDFS-16161
> URL: https://issues.apache.org/jira/browse/HDFS-16161
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Reporter: Wei-Chiu Chuang
> Priority: Major
>
> One of our user reported this error in the log:
> {noformat}
> 2021-07-30 09:51:27,509 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> an02nphda5777.npa.bfsiplatform.com:1004:DataXceiver error processing
> READ_BLOCK operation src: /10.30.10.68:35680 dst: /10.30.10.67:1004
> java.lang.IllegalArgumentException: id=-46 out of range [0, 5)
> at
> org.apache.hadoop.util.DataChecksum$Type.valueOf(DataChecksum.java:76)
> at
> org.apache.hadoop.util.DataChecksum.newDataChecksum(DataChecksum.java:167)
> {noformat}
> Analysis:
> it looks like the first few bytes of checksum was bad. The first few bytes
> determines the type of checksum (CRC32, CRC32C…etc). But the block was never
> reported to NameNode and removed.
> if DN throws an IOException reading a block, it starts another thread to scan
> the block. If the block is indeed bad, it tells NN it’s got a bad block. But
> this is an IllegalArgumentException which is a RuntimeException not an IOE so
> it’s not handled that way.
> its’ a bug in the error handling code. It should be made more graceful.
> Suggest: catch the IllegalArgumentException in
> BlockMetadataHeader.preadHeader() and throw CorruptMetaHeaderException, so
> that DN catches the exception and perform the regular block scan check.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]