Phantom wrote:
(1) Set my fs.default.name set to hdfs://<host>:<port> and also specify it
in the JobConf configuration. Copy my sample input file into HDFS using
"bin/hadoop fd -put" from my local file system. I then need to specify this
file to my WordCount sample as input. Should I specify this file with the
hdfs:// directive ?
(2) Set my fs.default.name set to file://<host>:<port> and also specify it
in the JobConf configuration. Just specify the input path to the WordCount
sample and everything should work if the path is available to all machines
in the cluster ?
Which way should I go ?
Either should work. So should a third option, which is to have your job
input in the non-default filesystem, but there's currently a bug that
prevents that from working. But the above two should work. The second
assumes that the input is available on the same path in the native
filesystem on all nodes.
When naming files in the default filesystem you do not need to specify
their filesystem, since it is the default, but it is not an error to
specify it.
The most common mode of distributed operation is (1): use an HDFS
filesytem as your fs.default.name, copy your initial input into that
filesystem with 'bin/hadoop fs -put localPath hdfsPath', then specify
'hdfsPath' as your job's input. The "hdfs://host:port" is not required
at this point, since it is the default.
Doug