Thanks Doug! Once my application is ready and working, I mostly will shift to HDFS. But that could be a big change, as the program that creates those files would have to be modified. One good thing for me is that currently the data is classified and split onto separate files and then available via different machines via NFS mounts. So another question from me is as follows:
I read from Hadoop docs that the task scheduler tries to execute the task closer to the data. Can this functionality be applied without using HDFS? How? ~ Neeraj On 6/11/07, Doug Cutting <[EMAIL PROTECTED]> wrote:
Neeraj Mahajan wrote: > But I do not want to create HDFS as I already have the data available on > all > the machine and I do not want to again transfer the data to the new file > system. Is it possible to skip HDFS but use the MapReduce functionality? > Any > idea what would have to be done? Hadoop requires that input paths are universal across nodes. So if you have data that is accessible from all nodes through the local filesystem (either by copying it there or via nfs mounts) then, so long as it is accessible through the same path on all nodes, Hadoop should work fine: the data named by file:///my_data/foo/bar should be the same on all hosts. That said, accessing data over NFS will probably be slower than over HDFS. If the data resides on only a small subset of your nodes then these nodes could become overloaded. As a general rule, if you're going to touch the data more than once, and have room, it would probably be a good idea to copy it into an HDFS filesystem. Doug
