Thanks Doug!
Once my application is ready and working, I mostly will shift to HDFS.
But that could be a big change, as the program that creates those files
would have to be modified. One good thing for me is that currently the data
is classified and split onto separate files and then available via different
machines via NFS mounts.
So another question from me is as follows:

I read from Hadoop docs that the task scheduler tries to execute the task
closer to the data. Can this functionality be applied without using HDFS?
How?

~ Neeraj

On 6/11/07, Doug Cutting <[EMAIL PROTECTED]> wrote:

Neeraj Mahajan wrote:
> But I do not want to create HDFS as I already have the data available on
> all
> the machine and I do not want to again transfer the data to the new file
> system. Is it possible to skip HDFS but use the MapReduce functionality?
> Any
> idea what would have to be done?

Hadoop requires that input paths are universal across nodes.  So if you
have data that is accessible from all nodes through the local filesystem
(either by copying it there or via nfs mounts) then, so long as it is
accessible through the same path on all nodes, Hadoop should work fine:
the data named by file:///my_data/foo/bar should be the same on all hosts.

That said, accessing data over NFS will probably be slower than over
HDFS.  If the data resides on only a small subset of your nodes then
these nodes could become overloaded.  As a general rule, if you're going
to touch the data more than once, and have room, it would probably be a
good idea to copy it into an HDFS filesystem.

Doug

Reply via email to