Had a really peculiar thing happen today: a file that a job of mine created on HDFS seems to have disappeared, and I'm scratching my head as to how this could have happened without any errors getting thrown.

I ran a M/R job that created a big bunch of files. Job completed without errors, and it seems like file got created correctly. However when another automated process went to use the file later, it failed because the file wasn't there.

When I Sherlock Holmes my way through various log files, I see the following:

1) M/R job task says that it successfully created the file:

2010-11-10 21:42:28,356 INFO <our package name>.StateFileGeneratorOutputFormat: Data from partition #1 being written to file hdfs://<our name node name>:9000<our root dir>/2010.11.10-21.05.29/output/_temporary/_attempt_201010221550_0418_r_000001_0/shard2/IntentTrait.state

2) Name node also says that it created the file:

[r...@hdmaster hadoop-0.20]# grep 2010.11.10-21.05.29 hadoop-hadoop-namenode-hdmaster.log.2010-11-10 | grep shard2/IntentTrait.state 2010-11-10 21:42:28,442 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root,root,bin,daemon,sys,adm,disk,wheel ip=/<our data node> cmd=create src=<our root dir>/2010.11.10-21.05.29/output/_temporary/_attempt_201010221550_0418_r_000001_0/shard2/IntentTrait.state dst=null perm=root:supergroup:rw-r--r-- 2010-11-10 21:42:29,802 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: <our root dir>/2010.11.10-21.05.29/output/_temporary/_attempt_201010221550_0418_r_000001_0/shard2/IntentTrait.state. blk_-4237880568969698703_13404582 2010-11-10 21:42:30,360 INFO org.apache.hadoop.hdfs.StateChange: Removing lease on file <our root dir>/2010.11.10-21.05.29/output/_temporary/_attempt_201010221550_0418_r_000001_0/shard2/IntentTrait.state from client DFSClient_attempt_201010221550_0418_r_000001_0 2010-11-10 21:42:30,360 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file <our root dir>/2010.11.10-21.05.29/output/_temporary/_attempt_201010221550_0418_r_000001_0/shard2/IntentTrait.state is closed by DFSClient_attempt_201010221550_0418_r_000001_0

2a) Name node does *not* say in the logs anywhere that it deleted the file. (i.e., the above 4 log lines are the only lines in the name node log containing this file name)

3) Data node says that it created the file ...

[r...@hddata09 hadoop-0.20]# grep 4237880568969698703_13404582 hadoop-hadoop-datanode-hddata09.log.2010-11-10 2010-11-10 21:42:29,827 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-4237880568969698703_13404582 src: /<our data node>:43883 dest: /<our data node>:50010 2010-11-10 21:42:30,354 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /<our data node>:43883, dest: /<our data node>:50010, bytes: 326522, op: HDFS_WRITE, cliID: DFSClient_attempt_201010221550_0418_r_000001_0, offset: 0, srvID: DS-229869750-<our data node>-50010-1286316976632, blockid: blk_-4237880568969698703_13404582, duration: 150566000 2010-11-10 21:42:30,354 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block blk_-4237880568969698703_13404582 terminating

4) ... but then it deletes it a few seconds later:

2010-11-10 21:42:34,831 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block blk_-4237880568969698703_13404582 file /usr/local/hadoop/dfs/data/current/subdir36/subdir10/blk_-4237880568969698703 for deletion 2010-11-10 21:42:34,832 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block blk_-4237880568969698703_13404582 at file /usr/local/hadoop/dfs/data/current/subdir36/subdir10/blk_-4237880568969698703


Note that a) no errors were thrown anywhere about this file or block, and b) there's nothing in the name node log indicating that the file was explicitly deleted.

I'm scratching my head as to how this could have happened. Anyone have any ideas / further debugging suggestions?

Thanks,

DR

Reply via email to