[jira] [Updated] (HDFS-10788) fsck NullPointerException when it encounters corrupt replicas

Jeff Field (JIRA) Tue, 23 Aug 2016 14:05:43 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jeff Field updated HDFS-10788:
------------------------------
    Environment: CDH5.5.2, CentOS 6.7  (was: 2.6.0-cdh5.5.2 is the HDFS version 
from the CDH build this cluster is running.)

> fsck NullPointerException when it encounters corrupt replicas
> -------------------------------------------------------------
>
>                 Key: HDFS-10788
>                 URL: https://issues.apache.org/jira/browse/HDFS-10788
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.0
>         Environment: CDH5.5.2, CentOS 6.7
>            Reporter: Jeff Field
>
> Somehow (I haven't found root cause yet) we ended up with blocks that have 
> corrupt replicas where the replica count is inconsistent between the blockmap 
> and the corrupt replicas map. If we try to hdfs fsck any parent directory 
> that has a child with one of these blocks, fsck will exit with something like 
> this:
> {code}
> $ hdfs fsck /path/to/parent/dir/ | egrep -v '^\.+$'
> Connecting to namenode via http://mynamenode:50070
> FSCK started by bot-hadoop (auth:KERBEROS_SSL) from /10.97.132.43 for path 
> /path/to/parent/dir/ at Tue Aug 23 20:34:58 UTC 2016
> .........................................................................FSCK 
> ended at Tue Aug 23 20:34:59 UTC 2016 in 1098 milliseconds
> null
> Fsck on path '/path/to/parent/dir/' FAILED
> {code}
> So I start at the top, fscking every subdirectory until I find one or more 
> that fails. Then I do the same thing with those directories (our top level 
> directories all have subdirectories with date directories in them, which then 
> contain the files) and once I find a directory with files in it, I run a 
> checksum of the files in that directory. When I do that, I don't get the name 
> of the file, rather I get:
> checksum: java.lang.NullPointerException
> but since the files are in order, I can figure it out by seeing which file 
> was before the NPE. Once I get to this point, I can see the following in the 
> namenode log when I try to checksum the corrupt file:
> 2016-08-23 20:24:59,627 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Inconsistent 
> number of corrupt replicas for blk_1335893388_1100036319546 blockMap has 0 
> but corrupt replicas map has 1
> 2016-08-23 20:24:59,627 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 23 on 8020, call 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 
> 192.168.1.100:47785 Call#1 Retry#0
> java.lang.NullPointerException
> At which point I can delete the file, but it is a very tedious process.
> Ideally, shouldn't fsck be able to emit the name of the file that is the 
> source of the problem - and (if -delete is specified) get rid of the file, 
> instead of exiting without saying why?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDFS-10788) fsck NullPointerException when it encounters corrupt replicas

Reply via email to