The fs.default.name hadoop config variable sets the filesystem type, whether writing to DFS or local.

For local the hadoop.tmp.dir config variable sets where intermediate (during map-reduce processing) files are written to. The default for that is (on linux) /tmp/hadoop-${user}. When writing out local files it will write out to either:

1) Relative from the base directory in which nutch was started if using a relative path. So if doing bin/nutch it will be the nutch base directory. 2) the exact local filesystem location if using a fully qualified path and if permissions to write to that location.

If writing out to DFS the process is the same although intermediate files will be on each node as they process the task/job. Final output will be written:

1) Relative to the user directory on the DFS if relative path.
2) The exact location on the dfs if fully qualified pathname.

DFS is stored in blocks and the blocks will be stored on different nodes per the replication level under the directory specified by the dfs.data.dir hadoop config variable which is
${hadoop.tmp.dir}/dfs/data by default.

The directory that a job outputs to is set by the mapred.output.dir config variable of the jobconf and is usually set programatically using the FileOutputFormat.setOutputPath method. When not set the directory defaults to the current working directory, meaning the directory that nutch was started in for local.

Dennis

Sunnyvale Fl wrote:
Hi all,

I have a question about where Nutch and Hadoop write intermediate files to.
I installed Nutch in a location where it does not have write permission.
I'd like the intermediate files (e.g. tmp index files during merge or
invertlink etc) to be written to a different location where Nutch and Hadoop
has write access.  I know some Hadoop intermediate files are written to
/tmp/ which are fine, but Nutch isn't fully configurable as far as the
working directory is concerned.  It seems to always write to the base dir of
the installation, except for a few commands such as "merge" where you can
specify the working dir.

Any insight?  Am I missing some major configuration xml files?

Thanks!
GL

Reply via email to