Just in case this can help somebody else and because I just spent a couple of hours debugging this, thought I would share and insight. This only affects locally running jobs, not the DFS, and should only affect windows users.

On windows with hadoop 0.14 and below, you used to be able to do something like this:

<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop</value>
  <description>A base for other temporary directories.</description>
</property>

Essentially ignoring the C:, Well in hadoop version 0.15 and above while hadoop won't complain when the jobs starts, you will start getting errors such as this while running jobs such as nutch injector:

java.io.IOException: Target file:/C:/nutch/hadoop/mapred/temp/inject-temp-241790994/_reduce_bcubf6/part-00000 already exists
        at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)

What is happening here is the file system code in hadoop has changed so some Path objects are getting resolved to / and some are getting resolved to C:/. (See the RawLocalFileStatus(File f) constructor in RawLocalFileSystem if your are interested. It happens in the f.toURI().toString() constructor parameter)

Hadoop sometimes creates relative paths to move files around and so a relative path of C:/ from a path of / becomes /C:/... which is an absolute path and the job fails because it can't copy to itself.

So, long story short, on windows, when running local jobs with hadoop >= 0.15, always use the C:/ notation to avoid problems.

Dennis Kubes

Reply via email to