Thank you for the explanations, Bobby.  That helps significantly.

I also read the article below which gave me a better understanding of the 
relative merits of MapReduce/Hadoop vs MPI.  Alberto, you might find it useful 
too.
http://grids.ucs.indiana.edu/ptliupages/publications/CloudsandMR.pdf

There is even a MapReduce API built on top of MPI developed at Sandia.

So many options to choose from :-)

Cheers,
Parker

> From: [email protected]
> To: [email protected]
> Date: Mon, 12 Sep 2011 14:02:44 -0700
> Subject: Re: Is Hadoop the right platform for my HPC application?
> 
> Parker,
> 
> The hadoop command itself is just a shell script that sets up your classpath 
> and some environment variables for a JVM.  Hadoop provides a java API and you 
> should be able to use to write you application, without dealing with the 
> command line.  That being said there is no Map/Reduce C/C++ API.  There is 
> libhdfs.so that will allow you to read/write HDFS files from a C/C++ program, 
> but it actually launches a JVM behind the scenes to handle the actual 
> requests.
> 
> As for a way to avoid writing your input data into files, the data has to be 
> distributed to the compute nodes some how.  You could write a custom input 
> format that does not use any input files, and then have it load the data a 
> different way.  I believe that some people do this to load data from MySQL or 
> some other DB for processing.  Similarly you could do something with the 
> output format to put the data someplace else.
> 
> It is hard to say if Hadoop is the right platform without more information 
> about what you are doing.  Hadoop has been used for lots of embarrassingly 
> parallel problems.  The processing is easy, the real question is where is 
> your data coming from, and where are the results going.  Map/Reduce is fast 
> in part because it tries to reduce data movement and move the computation to 
> the data, not the other way round.  Without knowing the expected size of your 
> data or the amount of processing that it will do, it is hard to say.
> 
> --Bobby Evans
> 
> On 9/12/11 5:09 AM, "Parker Jones" <[email protected]> wrote:
> 
> 
> 
> Hello all,
> 
> I have Hadoop up and running and an embarrassingly parallel problem but can't 
> figure out how to arrange the problem.  My apologies in advance if this is 
> obvious and I'm not getting it.
> 
> My HPC application isn't a batch program, but runs in a continuous loop (like 
> a server) *outside* of the Hadoop machines, and it should occasionally farm 
> out a large computation to Hadoop and use the results.  However, all the 
> examples I have come across interact with Hadoop via files and the command 
> line.  (Perhaps I am looking at the wrong places?)
> 
> So,
> * is Hadoop the right platform for this kind of problem?
> * is it possible to use Hadoop without going through the command line and 
> writing all input data to files?
> 
> If so, could someone point me to some examples and documentation.  I am 
> coding in C/C++ in case that is relevant, but examples in any language should 
> be helpful.
> 
> Thanks for any suggestions,
> Parker
> 
> 
> 
                                          

Reply via email to