Jeff Field created HDFS-10788:
---------------------------------

             Summary: fsck NullPointerException when it encounters corrupt 
replicas
                 Key: HDFS-10788
                 URL: https://issues.apache.org/jira/browse/HDFS-10788
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 2.6.0
         Environment: 2.6.0-cdh5.5.2 is the HDFS version from the CDH build 
this cluster is running.
            Reporter: Jeff Field


Somehow (I haven't found root cause yet) we ended up with blocks that have 
corrupt replicas where the replica count is inconsistent between the blockmap 
and the corrupt replicas map. If we try to hdfs fsck any parent directory that 
has a child with one of these blocks, fsck will exit with something like this:

{code}
$ hdfs fsck /path/to/parent/dir/ | egrep -v '^\.+$'
Connecting to namenode via http://mynamenode:50070
FSCK started by bot-hadoop (auth:KERBEROS_SSL) from /10.97.132.43 for path 
/path/to/parent/dir/ at Tue Aug 23 20:34:58 UTC 2016
.........................................................................FSCK 
ended at Tue Aug 23 20:34:59 UTC 2016 in 1098 milliseconds
null

Fsck on path '/path/to/parent/dir/' FAILED
{code}

So I start at the top, fscking every subdirectory until I find one or more that 
fails. Then I do the same thing with those directories (our top level 
directories all have subdirectories with date directories in them, which then 
contain the files) and once I find a directory with files in it, I run a 
checksum of the files in that directory. When I do that, I don't get the name 
of the file, rather I get:
checksum: java.lang.NullPointerException

but since the files are in order, I can figure it out by seeing which file was 
before the NPE. Once I get to this point, I can see the following in the 
namenode log when I try to checksum the corrupt file:

2016-08-23 20:24:59,627 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Inconsistent number 
of corrupt replicas for blk_1335893388_1100036319546 blockMap has 0 but corrupt 
replicas map has 1
2016-08-23 20:24:59,627 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
23 on 8020, call 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 
192.168.1.100:47785 Call#1 Retry#0
java.lang.NullPointerException

At which point I can delete the file, but it is a very tedious process.

Ideally, shouldn't fsck be able to emit the name of the file that is the source 
of the problem - and (if -delete is specified) get rid of the file, instead of 
exiting without saying why?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to