On Wed, 24 Nov 2010 10:30:09 +0100 Erik Forsberg <forsb...@opera.com> wrote:
> Hi! > > I'm having some trouble with Map/Reduce jobs failing due to HDFS > errors. I've been digging around the logs trying to figure out what's > happening, and I see the following in the datanode logs: > > 2010-11-19 10:27:01,059 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in > BlockReceiver.lastNodeRun: java.io.IOException: No temporary > file /opera/log4/hadoop/dfs/data/tmp/blk_-8143694940938019938 for > block blk_-8143694940938019938_6144372 at <snip> > > What would be the possible causes of such exceptions? It seems the cause of this was my puppetd not being able to detect that the datanode was already running, which caused it to try to start a second datanode. That in turn seems to cause tmp directories to be cleaned before the second datanode finds out that the storage directories are locked. Some kind of race condition I would guess, because it only happens on systems with high load. More details here: https://groups.google.com/a/cloudera.org/group/cdh-user/browse_frm/thread/d4572d2d1191be91# \EF -- Erik Forsberg <forsb...@opera.com> Developer, Opera Software - http://www.opera.com/