Re: Hit a roadbump in solving truncated block issue

Garhan Attebury Fri, 19 Dec 2008 10:56:54 -0800

Actually we do have the namenode logs for the period Brian mentioned.In Brian's email, he shows the log entries on node191 corresponding toit storing the third (new) replica of the block in question. Thenamenode log from that period shows:

2008-12-12 08:53:02,637 INFO org.apache.hadoop.hdfs.StateChange:BLOCK* ask 172.16.1.121:50010 to replicateblk_7345861444716855534_7201 to datanode(s) 172.16.1.191:500102008-12-12 08:53:17,127 INFO org.apache.hadoop.hdfs.StateChange:BLOCK* NameSystem.addStoredBlock: blockMap updated: 172.16.1.191:50010is added to blk_7345861444716855534_7201 size 134217728

As you can see, node191 correctly claimed it received the block ofsize 0, yet the namenode claims that node191 has the block of size134217728.

Looking at more of the namenode logs I found another instance wherethe namenode again replicated the block due to a node going down, andjust as we see here with node191, this new datanode stored the blockwith size 0 while the namenode seemed to think everything was correct.




- Garhan Attebury


On Dec 19, 2008, at 11:57 AM, Brian Bockelman wrote:

Hey Raghu,
I never heard back from you about whether any of these fixes areready to try out. Things are getting kind of bad here.
Even at three replicas, I found one block which has all threereplicas of length=0. Grepping through the logs, I get things likethis:
2008-12-18 22:45:04,680 WARNorg.apache.hadoop.hdfs.server.datanode.DataNode:DatanodeRegistration(172.16.1.121:50010,storageID=DS-1732140560-172.16.1.121-50010-1228236234012,infoPort=50075, ipcPort=50020):Got exception while servingblk_7345861444716855534_7201 to /172.16.1.1:java.io.IOException: Offset 35307520 and length 10485760 don'tmatch block blk_7345861444716855534_7201 ( blockLen 0 )java.io.IOException: Offset 35307520 and length 10485760 don'tmatch block blk_7345861444716855534_7201 ( blockLen 0 )
On the other hand, if I look for the block scanner activity:
2008-12-08 13:59:15,616 INFOorg.apache.hadoop.hdfs.server.datanode.DataBlockScanner:Verification succeeded for blk_7345861444716855534_7201
There is indeed a zero-sized file on disk and matching *correct*metadata:
[r...@node121 ~]# find /hadoop-data/ -name *7345861444716855534* -exec ls -lh {} \;-rw-r--r-- 1 root root 7 Dec 3 15:44 /hadoop-data/dfs/data/current/subdir9/subdir6/blk_7345861444716855534_7201.meta-rw-r--r-- 1 root root 0 Dec 3 15:44 /hadoop-data/dfs/data/current/subdir9/subdir6/blk_7345861444716855534
The metadata matches the 0-sized block, not the full one, of course.
We recently went from 2 replicas to 3 replicas on Dec 11. On Dec12, a replicas was created on node191:
[r...@node191 ~]# find /hadoop-data/ -name *7345861444716855534* -exec ls -lh {} \;-rw-r--r-- 1 root root 7 Dec 12 08:53 /hadoop-data/dfs/data/current/subdir40/subdir37/subdir42/blk_7345861444716855534_7201.meta-rw-r--r-- 1 root root 0 Dec 12 08:53 /hadoop-data/dfs/data/current/subdir40/subdir37/subdir42/blk_7345861444716855534
The corresponding log entries are here:
2008-12-12 08:53:09,014 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: Receiving blockblk_7345861444716855534_7201 src: /172.16.1.121:47799 dest: /172.16.1.191:500102008-12-12 08:53:17,134 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: Received blockblk_7345861444716855534_7201 src: /172.16.1.121:47799 dest: /172.16.1.191:50010 of size 0
So, the incorrectly-sized block had a new copy created, the datanodereported the incorrect size (!), and the namenode never deleted itafterward. I unfortunately don't have the namenode logs from thisperiod.
Brian

Re: Hit a roadbump in solving truncated block issue

Reply via email to