[
https://issues.apache.org/jira/browse/HDFS-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeff Field updated HDFS-10788:
------------------------------
Environment: CDH5.5.2, CentOS 6.7 (was: 2.6.0-cdh5.5.2 is the HDFS version
from the CDH build this cluster is running.)
> fsck NullPointerException when it encounters corrupt replicas
> -------------------------------------------------------------
>
> Key: HDFS-10788
> URL: https://issues.apache.org/jira/browse/HDFS-10788
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.6.0
> Environment: CDH5.5.2, CentOS 6.7
> Reporter: Jeff Field
>
> Somehow (I haven't found root cause yet) we ended up with blocks that have
> corrupt replicas where the replica count is inconsistent between the blockmap
> and the corrupt replicas map. If we try to hdfs fsck any parent directory
> that has a child with one of these blocks, fsck will exit with something like
> this:
> {code}
> $ hdfs fsck /path/to/parent/dir/ | egrep -v '^\.+$'
> Connecting to namenode via http://mynamenode:50070
> FSCK started by bot-hadoop (auth:KERBEROS_SSL) from /10.97.132.43 for path
> /path/to/parent/dir/ at Tue Aug 23 20:34:58 UTC 2016
> .........................................................................FSCK
> ended at Tue Aug 23 20:34:59 UTC 2016 in 1098 milliseconds
> null
> Fsck on path '/path/to/parent/dir/' FAILED
> {code}
> So I start at the top, fscking every subdirectory until I find one or more
> that fails. Then I do the same thing with those directories (our top level
> directories all have subdirectories with date directories in them, which then
> contain the files) and once I find a directory with files in it, I run a
> checksum of the files in that directory. When I do that, I don't get the name
> of the file, rather I get:
> checksum: java.lang.NullPointerException
> but since the files are in order, I can figure it out by seeing which file
> was before the NPE. Once I get to this point, I can see the following in the
> namenode log when I try to checksum the corrupt file:
> 2016-08-23 20:24:59,627 WARN
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Inconsistent
> number of corrupt replicas for blk_1335893388_1100036319546 blockMap has 0
> but corrupt replicas map has 1
> 2016-08-23 20:24:59,627 WARN org.apache.hadoop.ipc.Server: IPC Server handler
> 23 on 8020, call
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from
> 192.168.1.100:47785 Call#1 Retry#0
> java.lang.NullPointerException
> At which point I can delete the file, but it is a very tedious process.
> Ideally, shouldn't fsck be able to emit the name of the file that is the
> source of the problem - and (if -delete is specified) get rid of the file,
> instead of exiting without saying why?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]