[jira] Commented: (HDFS-1371) One bad node can incorrectly flag many files as corrupt

Koji Noguchi (JIRA) Fri, 03 Sep 2010 09:25:58 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905952#action_12905952
 ]


Koji Noguchi commented on HDFS-1371:
------------------------------------

(you guys are too fast.  I wanted the description to be short and was going to 
paste the logs afterwards... )

Picking one such file: /myfile/part-00145.gz blk_-1426587446408804113_970819282

Namenode log showing
{noformat}
2010-08-31 10:47:56,258 INFO org.apache.hadoop.hdfs.StateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_-1426587446408804113 added as corrupt 
on ZZ.YY.XX..220:1004 by /ZZ.YY.XX.246
2010-08-31 10:47:56,290 INFO org.apache.hadoop.hdfs.StateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_-1426587446408804113 added as corrupt 
on ZZ.YY.XX..252:1004 by /ZZ.YY.XX.246
2010-08-31 10:47:56,489 INFO org.apache.hadoop.hdfs.StateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_-1426587446408804113 added as corrupt 
on ZZ.YY.XX..107:1004 by /ZZ.YY.XX.246
2010-08-31 10:49:00,508 INFO org.apache.hadoop.hdfs.StateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: duplicate requested for 
blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.252:1004 by /ZZ.YY.XX.246
2010-08-31 10:49:00,554 INFO org.apache.hadoop.hdfs.StateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: duplicate requested for 
blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.107:1004 by /ZZ.YY.XX.246
2010-08-31 10:49:03,934 INFO org.apache.hadoop.hdfs.StateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: duplicate requested for 
blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.220:1004 by /ZZ.YY.XX.246
2010-08-31 10:49:03,949 INFO org.apache.hadoop.hdfs.StateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: duplicate requested for 
blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.252:1004 by /ZZ.YY.XX.246
2010-08-31 10:49:03,971 INFO org.apache.hadoop.hdfs.StateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: duplicate requested for 
blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.107:1004 by /ZZ.YY.XX.246
2010-08-31 10:49:07,986 INFO org.apache.hadoop.hdfs.StateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: duplicate requested for 
blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.252:1004 by /ZZ.YY.XX.246
2010-08-31 10:49:08,257 INFO org.apache.hadoop.hdfs.StateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: duplicate requested for 
blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.220:1004 by /ZZ.YY.XX.246
2010-08-31 10:49:08,895 INFO org.apache.hadoop.hdfs.StateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: duplicate requested for 
blk_-1426587446408804113 to add as corrupt on ZZ.YY.XX.107:1004 by /ZZ.YY.XX.246
{noformat}

User Tasklogs on ZZ.YY.XX.246 showing
{noformat}
[[email protected] ~]# find /my/mapred/userlogs/ -type f -exec grep 
1426587446408804113 \{\} \; -print
org.apache.hadoop.fs.ChecksumException: Checksum error: 
/blk_-1426587446408804113:of:/myfile/part-00145.gz at 222720
2010-08-31 10:47:56,256 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum 
error for blk_-1426587446408804113_970819282 from ZZ.YY.XX.220:1004 at 222720
org.apache.hadoop.fs.ChecksumException: Checksum error: 
/blk_-1426587446408804113:of:/myfile/part-00145.gz at 103936
2010-08-31 10:47:56,284 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum 
error for blk_-1426587446408804113_970819282 from ZZ.YY.XX.252:1004 at 103936
org.apache.hadoop.fs.ChecksumException: Checksum error: 
/blk_-1426587446408804113:of:/myfile/part-00145.gz at 250368
2010-08-31 10:47:56,464 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum 
error for blk_-1426587446408804113_970819282 from ZZ.YY.XX.107:1004 at 250368
2010-08-31 10:47:56,490 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain 
block blk_-1426587446408804113_970819282 from any node: java.io.IOException: No 
live nodes contain current block. Will get new block locations from namenode 
and retry...
{noformat}

This was consistent among all the 12 files reported as corrupt.  All from the 
same node ZZ.YY.XX.246.


When trying to pull this file from other healthy node, to my surprise it didn't 
fail.

{noformat}
[knogu...@gwgd4003 ~]$ hadoop dfs -ls /myfile/part-00145.gz
Found 1 items
-rw-r--r--   3 user1 users   67771377 2010-08-31 06:46 /myfile/part-00145.gz

[knogu...@gwgd4003 ~]$ hadoop fsck /myfile/part-00145.gz
.
/myfile/part-00145.gz: CORRUPT block blk_-1426587446408804113
Status: CORRUPT
 Total size:    67771377 B
 Total dirs:    0
 Total files:   1
 Total blocks (validated):      1 (avg. block size 67771377 B)
  ********************************
  CORRUPT FILES:        1
  CORRUPT BLOCKS:       1
  ********************************
 Minimally replicated blocks:   1 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0
 Corrupt blocks:                1
 Missing replicas:              0 (0.0 %)


The filesystem under path '/myfile/part-00145.gz' is CORRUPT
[knogu...@gwgd4003 ~]$
[knogu...@gwgd4003 ~]$ hadoop dfs -get /myfile/part-00145.gz /tmp
[knogu...@gwgd4003 ~]$ echo $?
0
[knogu...@gwgd4003 ~]$ ls -l /tmp/part-00145.gz
-rw-r--r-- 1 knoguchi users 67771377 Sep  2 21:04 /tmp/part-00145.gz
[knogu...@gwgd4003 ~]$
{noformat}




> One bad node can incorrectly flag many files as corrupt
> -------------------------------------------------------
>
>                 Key: HDFS-1371
>                 URL: https://issues.apache.org/jira/browse/HDFS-1371
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client, name-node
>    Affects Versions: 0.20.1
>         Environment: yahoo internal version 
> [knogu...@gwgd4003 ~]$ hadoop version
> Hadoop 0.20.104.3.1007030707
>            Reporter: Koji Noguchi
>
> On our cluster, 12 files were reported as corrupt by fsck even though the 
> replicas on the datanodes were healthy.
> Turns out that all the replicas (12 files x 3 replicas per file) were 
> reported corrupt from one node.
> Surprisingly, these files were still readable/accessible from dfsclient 
> (-get/-cat) without any problems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1371) One bad node can incorrectly flag many files as corrupt

Reply via email to