Had a really peculiar thing happen today: a file that a job of mine
created on HDFS seems to have disappeared, and I'm scratching my head as
to how this could have happened without any errors getting thrown.
I ran a M/R job that created a big bunch of files. Job completed
without errors, and it seems like file got created correctly. However
when another automated process went to use the file later, it failed
because the file wasn't there.
When I Sherlock Holmes my way through various log files, I see the
following:
1) M/R job task says that it successfully created the file:
2010-11-10 21:42:28,356 INFO <our package
name>.StateFileGeneratorOutputFormat: Data from partition #1 being
written to file hdfs://<our name node name>:9000<our root
dir>/2010.11.10-21.05.29/output/_temporary/_attempt_201010221550_0418_r_000001_0/shard2/IntentTrait.state
2) Name node also says that it created the file:
[r...@hdmaster hadoop-0.20]# grep 2010.11.10-21.05.29
hadoop-hadoop-namenode-hdmaster.log.2010-11-10 | grep
shard2/IntentTrait.state
2010-11-10 21:42:28,442 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=root,root,bin,daemon,sys,adm,disk,wheel ip=/<our data node>
cmd=create src=<our root
dir>/2010.11.10-21.05.29/output/_temporary/_attempt_201010221550_0418_r_000001_0/shard2/IntentTrait.state
dst=null perm=root:supergroup:rw-r--r--
2010-11-10 21:42:29,802 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.allocateBlock: <our root
dir>/2010.11.10-21.05.29/output/_temporary/_attempt_201010221550_0418_r_000001_0/shard2/IntentTrait.state.
blk_-4237880568969698703_13404582
2010-11-10 21:42:30,360 INFO org.apache.hadoop.hdfs.StateChange:
Removing lease on file <our root
dir>/2010.11.10-21.05.29/output/_temporary/_attempt_201010221550_0418_r_000001_0/shard2/IntentTrait.state
from client DFSClient_attempt_201010221550_0418_r_000001_0
2010-11-10 21:42:30,360 INFO org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.completeFile: file <our root
dir>/2010.11.10-21.05.29/output/_temporary/_attempt_201010221550_0418_r_000001_0/shard2/IntentTrait.state
is closed by DFSClient_attempt_201010221550_0418_r_000001_0
2a) Name node does *not* say in the logs anywhere that it deleted the
file. (i.e., the above 4 log lines are the only lines in the name node
log containing this file name)
3) Data node says that it created the file ...
[r...@hddata09 hadoop-0.20]# grep 4237880568969698703_13404582
hadoop-hadoop-datanode-hddata09.log.2010-11-10
2010-11-10 21:42:29,827 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-4237880568969698703_13404582 src: /<our data node>:43883 dest:
/<our data node>:50010
2010-11-10 21:42:30,354 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /<our
data node>:43883, dest: /<our data node>:50010, bytes: 326522, op:
HDFS_WRITE, cliID: DFSClient_attempt_201010221550_0418_r_000001_0,
offset: 0, srvID: DS-229869750-<our data node>-50010-1286316976632,
blockid: blk_-4237880568969698703_13404582, duration: 150566000
2010-11-10 21:42:30,354 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
block blk_-4237880568969698703_13404582 terminating
4) ... but then it deletes it a few seconds later:
2010-11-10 21:42:34,831 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
blk_-4237880568969698703_13404582 file
/usr/local/hadoop/dfs/data/current/subdir36/subdir10/blk_-4237880568969698703
for deletion
2010-11-10 21:42:34,832 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
blk_-4237880568969698703_13404582 at file
/usr/local/hadoop/dfs/data/current/subdir36/subdir10/blk_-4237880568969698703
Note that a) no errors were thrown anywhere about this file or block,
and b) there's nothing in the name node log indicating that the file was
explicitly deleted.
I'm scratching my head as to how this could have happened. Anyone have
any ideas / further debugging suggestions?
Thanks,
DR