Re: Using Hadoop as a shared file system

Fernando Padilla Mon, 11 Feb 2008 14:45:13 -0800

Have you put any thought into Webdav?  Or did you write that off as well?



Albert Strasheim wrote:

Hello all
I have a slightly unusual use case that I hope I can use Hadoop for(maybe after writing a bit of code).
My setup looks as follows:
I have one machine containing somewhere from 750 GB to 2 TB of data on adisk or some kind of RAID-0 array. I am in full control of this machine.I have a backup of the part of the data that I haven't generated (maybe30% of the data, the rest is calculated from this original data).
I have another disparate collection of machines running any number ofoperating systems (Windows, Linux, Solaris, etc.). I might not haveadministrative access on these machines. I can probably run a JVM onthem. These machines should not be considered reliable in any sense ofthe word. Hosting a HDFS on them would be a bad idea -- on a bad day allof them might be down, they might get reformatted, you name it.
I need to process this data using legacy applications (e.g., C++programs, MATLAB scripts, Python scripts) and Java applications.
The legacy applications typically perform operations that would be hardto parallelise (e.g. a program that can't easily be compiled on all thedifferent machines, MATLAB licenses not being available for all themachines, etc.), and as such I would like to run these legacy apps onlyon the machine that directly has access to the data. I would prefer if Ididn't have to teach these legacy apps about getting data out of HDFS.
Through the magic of the JVM, I can can consider parallelising theprocessing of the data done with Java programs.
These Java programs (parallelised with Hadoop's MapReduce, or GridGain,or whatever) need an easy way to access the data on this single machine.
Setting up a traditional shared file system (NFS and Samba come to mind)would be a pain for various reasons. A shared file system probablywouldn't be a bottleneck, since the amount of processing time requiredtypically far outweighs the time it takes to access the data over thenetwork (at least for the number of nodes I'm dealing with).
What I'm hoping to do is use Hadoop as my shared file system.
I would imagine that I would have to run the equivalent of anamenode+datanode+etc. on the machine that has direct access to the dataso that it appears as a HDFS to the other machines. Making a copy of thedata into a HDFS instance isn't really an option, due to the size of thedata I'm dealing with, so I'm thinking along the lines of exposing aLocalFileSystem on this machine as a DistributedFileSystem to the othermachines.
This has the added advantage that if I ever do get a stable HDFS setupgoing, my programs will be ready to deal with it.
Is something like this doable already? If not, where would one need tostart filling in code to make it possible?
Thanks for reading.

Regards,

Albert

Re: Using Hadoop as a shared file system

Reply via email to