[ https://issues.apache.org/jira/browse/HDFS-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yongjun Zhang resolved HDFS-10788. ---------------------------------- Resolution: Duplicate Thanks guys, I'm marking it as duplicate of HDFS-9958. > fsck NullPointerException when it encounters corrupt replicas > ------------------------------------------------------------- > > Key: HDFS-10788 > URL: https://issues.apache.org/jira/browse/HDFS-10788 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.6.0 > Environment: CDH5.5.2, CentOS 6.7 > Reporter: Jeff Field > > Somehow (I haven't found root cause yet) we ended up with blocks that have > corrupt replicas where the replica count is inconsistent between the blockmap > and the corrupt replicas map. If we try to hdfs fsck any parent directory > that has a child with one of these blocks, fsck will exit with something like > this: > {code} > $ hdfs fsck /path/to/parent/dir/ | egrep -v '^\.+$' > Connecting to namenode via http://mynamenode:50070 > FSCK started by bot-hadoop (auth:KERBEROS_SSL) from /10.97.132.43 for path > /path/to/parent/dir/ at Tue Aug 23 20:34:58 UTC 2016 > .........................................................................FSCK > ended at Tue Aug 23 20:34:59 UTC 2016 in 1098 milliseconds > null > Fsck on path '/path/to/parent/dir/' FAILED > {code} > So I start at the top, fscking every subdirectory until I find one or more > that fails. Then I do the same thing with those directories (our top level > directories all have subdirectories with date directories in them, which then > contain the files) and once I find a directory with files in it, I run a > checksum of the files in that directory. When I do that, I don't get the name > of the file, rather I get: > checksum: java.lang.NullPointerException > but since the files are in order, I can figure it out by seeing which file > was before the NPE. Once I get to this point, I can see the following in the > namenode log when I try to checksum the corrupt file: > 2016-08-23 20:24:59,627 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Inconsistent > number of corrupt replicas for blk_1335893388_1100036319546 blockMap has 0 > but corrupt replicas map has 1 > 2016-08-23 20:24:59,627 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 23 on 8020, call > org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from > 192.168.1.100:47785 Call#1 Retry#0 > java.lang.NullPointerException > At which point I can delete the file, but it is a very tedious process. > Ideally, shouldn't fsck be able to emit the name of the file that is the > source of the problem - and (if -delete is specified) get rid of the file, > instead of exiting without saying why? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org