Phantom wrote:
(1) Set my fs.default.name set to hdfs://<host>:<port> and also specify it
in the JobConf configuration. Copy my sample input file into HDFS using
"bin/hadoop fd -put" from my local file system. I then need to specify this
file to my WordCount sample as input. Should I specify this file with the
hdfs:// directive ?

(2) Set my fs.default.name set to file://<host>:<port> and also specify it
in the JobConf configuration. Just specify the input path to the WordCount
sample and everything should work if the path is available to all machines
in the cluster ?

Which way should I go ?

Either should work. So should a third option, which is to have your job input in the non-default filesystem, but there's currently a bug that prevents that from working. But the above two should work. The second assumes that the input is available on the same path in the native filesystem on all nodes.

When naming files in the default filesystem you do not need to specify their filesystem, since it is the default, but it is not an error to specify it.

The most common mode of distributed operation is (1): use an HDFS filesytem as your fs.default.name, copy your initial input into that filesystem with 'bin/hadoop fs -put localPath hdfsPath', then specify 'hdfsPath' as your job's input. The "hdfs://host:port" is not required at this point, since it is the default.

Doug



Reply via email to