I am also interested about this option, since I will probably be hacking at such a thing in the next few weeks.

I am also curious if you can run MR jobs within process rather than launching each time. The scenario is when initialization takes just way too long for a map reduce shard to be executed in this model. For example, say you are trying to compute the top n terms within a set of documents where top n is those top rarest terms in some model corpus, perhaps you have a df index, or perhaps you have a huge nlp engine thats used for entity extraction, any of these assume a chunk of memory and a chunk of time to init each pass.

Here of course you really would need not only to specify the job, but somehow constrain the candidate nodes this can run on based upon their ability to run this.

C

On Jun 12, 2008, at 2:02 AM, Robert Krüger wrote:


Hi,

for our developers I would like to write a few lines of Java code that, given a base directory, sets up an HDFS filesystem, initializes it, if it is not there yet and then starts the service(s) in process. This is to run on each developer's machine, probably within a tomcat instance. I don't want to do this (if I don't have to) in a bunch of shell scripts.

Could anyone point to code samples that do similar things or give any other hints that make this easier than to look at what the Command line tools do and reverse engineer it from there?

Thanks in advance,

Robert

Reply via email to