I am also interested about this option, since I will probably be
hacking at such a thing in the next few weeks.
I am also curious if you can run MR jobs within process rather than
launching each time. The scenario is when initialization takes just
way too long for a map reduce shard to be executed in this model. For
example, say you are trying to compute the top n terms within a set of
documents where top n is those top rarest terms in some model corpus,
perhaps you have a df index, or perhaps you have a huge nlp engine
thats used for entity extraction, any of these assume a chunk of
memory and a chunk of time to init each pass.
Here of course you really would need not only to specify the job, but
somehow constrain the candidate nodes this can run on based upon their
ability to run this.
C
On Jun 12, 2008, at 2:02 AM, Robert Krüger wrote:
Hi,
for our developers I would like to write a few lines of Java code
that, given a base directory, sets up an HDFS filesystem,
initializes it, if it is not there yet and then starts the
service(s) in process. This is to run on each developer's machine,
probably within a tomcat instance. I don't want to do this (if I
don't have to) in a bunch of shell scripts.
Could anyone point to code samples that do similar things or give
any other hints that make this easier than to look at what the
Command line tools do and reverse engineer it from there?
Thanks in advance,
Robert