Just in case this can help somebody else and because I just spent a
couple of hours debugging this, thought I would share and insight. This
only affects locally running jobs, not the DFS, and should only affect
windows users.
On windows with hadoop 0.14 and below, you used to be able to do
something like this:
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop</value>
<description>A base for other temporary directories.</description>
</property>
Essentially ignoring the C:, Well in hadoop version 0.15 and above while
hadoop won't complain when the jobs starts, you will start getting
errors such as this while running jobs such as nutch injector:
java.io.IOException: Target
file:/C:/nutch/hadoop/mapred/temp/inject-temp-241790994/_reduce_bcubf6/part-00000
already exists
at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
What is happening here is the file system code in hadoop has changed so
some Path objects are getting resolved to / and some are getting
resolved to C:/. (See the RawLocalFileStatus(File f) constructor in
RawLocalFileSystem if your are interested. It happens in the
f.toURI().toString() constructor parameter)
Hadoop sometimes creates relative paths to move files around and so a
relative path of C:/ from a path of / becomes /C:/... which is an
absolute path and the job fails because it can't copy to itself.
So, long story short, on windows, when running local jobs with hadoop >=
0.15, always use the C:/ notation to avoid problems.
Dennis Kubes