Thank you for the explanations, Bobby. That helps significantly. I also read the article below which gave me a better understanding of the relative merits of MapReduce/Hadoop vs MPI. Alberto, you might find it useful too. http://grids.ucs.indiana.edu/ptliupages/publications/CloudsandMR.pdf
There is even a MapReduce API built on top of MPI developed at Sandia. So many options to choose from :-) Cheers, Parker > From: [email protected] > To: [email protected] > Date: Mon, 12 Sep 2011 14:02:44 -0700 > Subject: Re: Is Hadoop the right platform for my HPC application? > > Parker, > > The hadoop command itself is just a shell script that sets up your classpath > and some environment variables for a JVM. Hadoop provides a java API and you > should be able to use to write you application, without dealing with the > command line. That being said there is no Map/Reduce C/C++ API. There is > libhdfs.so that will allow you to read/write HDFS files from a C/C++ program, > but it actually launches a JVM behind the scenes to handle the actual > requests. > > As for a way to avoid writing your input data into files, the data has to be > distributed to the compute nodes some how. You could write a custom input > format that does not use any input files, and then have it load the data a > different way. I believe that some people do this to load data from MySQL or > some other DB for processing. Similarly you could do something with the > output format to put the data someplace else. > > It is hard to say if Hadoop is the right platform without more information > about what you are doing. Hadoop has been used for lots of embarrassingly > parallel problems. The processing is easy, the real question is where is > your data coming from, and where are the results going. Map/Reduce is fast > in part because it tries to reduce data movement and move the computation to > the data, not the other way round. Without knowing the expected size of your > data or the amount of processing that it will do, it is hard to say. > > --Bobby Evans > > On 9/12/11 5:09 AM, "Parker Jones" <[email protected]> wrote: > > > > Hello all, > > I have Hadoop up and running and an embarrassingly parallel problem but can't > figure out how to arrange the problem. My apologies in advance if this is > obvious and I'm not getting it. > > My HPC application isn't a batch program, but runs in a continuous loop (like > a server) *outside* of the Hadoop machines, and it should occasionally farm > out a large computation to Hadoop and use the results. However, all the > examples I have come across interact with Hadoop via files and the command > line. (Perhaps I am looking at the wrong places?) > > So, > * is Hadoop the right platform for this kind of problem? > * is it possible to use Hadoop without going through the command line and > writing all input data to files? > > If so, could someone point me to some examples and documentation. I am > coding in C/C++ in case that is relevant, but examples in any language should > be helpful. > > Thanks for any suggestions, > Parker > > >
