Re: Hit a roadbump in solving truncated block issue

Brian Bockelman Fri, 19 Dec 2008 09:58:19 -0800

Hey Raghu,

I never heard back from you about whether any of these fixes are readyto try out. Things are getting kind of bad here.

Even at three replicas, I found one block which has all three replicasof length=0. Grepping through the logs, I get things like this:

2008-12-18 22:45:04,680 WARNorg.apache.hadoop.hdfs.server.datanode.DataNode:DatanodeRegistration(172.16.1.121:50010,storageID=DS-1732140560-172.16.1.121-50010-1228236234012,infoPort=50075, ipcPort=50020):Got exception while servingblk_7345861444716855534_7201 to /172.16.1.1:java.io.IOException: Offset 35307520 and length 10485760 don't matchblock blk_7345861444716855534_7201 ( blockLen 0 )java.io.IOException: Offset 35307520 and length 10485760 don't matchblock blk_7345861444716855534_7201 ( blockLen 0 )


On the other hand, if I look for the block scanner activity:

2008-12-08 13:59:15,616 INFOorg.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verificationsucceeded for blk_7345861444716855534_7201

There is indeed a zero-sized file on disk and matching *correct*metadata:

[r...@node121 ~]# find /hadoop-data/ -name *7345861444716855534* -execls -lh {} \;-rw-r--r-- 1 root root 7 Dec 3 15:44 /hadoop-data/dfs/data/current/subdir9/subdir6/blk_7345861444716855534_7201.meta-rw-r--r-- 1 root root 0 Dec 3 15:44 /hadoop-data/dfs/data/current/subdir9/subdir6/blk_7345861444716855534


The metadata matches the 0-sized block, not the full one, of course.

We recently went from 2 replicas to 3 replicas on Dec 11. On Dec 12,a replicas was created on node191:

[r...@node191 ~]# find /hadoop-data/ -name *7345861444716855534* -execls -lh {} \;-rw-r--r-- 1 root root 7 Dec 12 08:53 /hadoop-data/dfs/data/current/subdir40/subdir37/subdir42/blk_7345861444716855534_7201.meta-rw-r--r-- 1 root root 0 Dec 12 08:53 /hadoop-data/dfs/data/current/subdir40/subdir37/subdir42/blk_7345861444716855534


The corresponding log entries are here:

2008-12-12 08:53:09,014 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: Receiving blockblk_7345861444716855534_7201 src: /172.16.1.121:47799 dest: /172.16.1.191:500102008-12-12 08:53:17,134 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: Received blockblk_7345861444716855534_7201 src: /172.16.1.121:47799 dest: /172.16.1.191:50010 of size 0

So, the incorrectly-sized block had a new copy created, the datanodereported the incorrect size (!), and the namenode never deleted itafterward. I unfortunately don't have the namenode logs from thisperiod.


Brian

On Dec 16, 2008, at 4:10 PM, Raghu Angadi wrote:

Brian Bockelman wrote:
Hey,
I hit a bit of a roadbump in solving the "truncated block issue" atour site: namely, some of the blocks appear perfectly valid to thedatanode. The block verifies, but it is still the wrong size (itappears that the metadata is too small too).What's the best way to proceed? It appears that either (a) theblock scanner needs to report to the datanode the size of the blockit just verified, which is possibly a scaling issue or (b) themetadata file needs to save the correct block size, which is apretty major modification, as it requires a change of the on-diskformat.
This should be detected by the NameNode. i.e. it should detect thisreplica is shorter (either compared to other replicas or theexpected size). There are various fixes (recent or being worked on)to this area of NameNode and it is mostly covered by of those orshould be soon.
Raghu.
Ideas?
Brian

Re: Hit a roadbump in solving truncated block issue

Reply via email to