Re: Serializing code to nodes: no can do?

Doug Cutting Tue, 24 Apr 2007 09:47:46 -0700

Pedro Guedes wrote:

For this I need to be able to register new steps in my chain and pass
them to hadoop to execute as a mapreduce job. I see two choices here:
1 - build a .job archive (main-class: mycrawler, submits jobs thru
JobClient) with my new steps and dependencies in the 'lib/' directory,
include my 'crawling-chain.xml' in the .job (to pass the configuration
of the chain to my crawler nodes) and then run it with the RunJar
utility thru a new Thread (so that i have a clean classpath, right?).
2 - in a new thread, configure my classpath to include the classes
needed by the crawling chain, write my crawler-chain to HDFS so that the
nodes can then read it on job execution, and then submit jobs thru
JobClient. When starting the mapreduce, nodes would then first read the
crawling-chain from hdfs, and then execute it on map or reduce.


Have I got it right? Which one sounds better?

If I understand your question, I think (1) is preferable. The MapReducesystem copies the job jar into HDFS and then reads it from nodes whenrunning tasks. This is optimized in several ways (file replication,caching, etc.) and is thus probably superior to implemnting somethingsimilar yourself, as described in (2).

Another question: if I use the setJar of the jobConf, will hadoop
include 'lib/*.jar' on the job.jar it sends to nodes?

Hadoop sends the entire job jar file. Jobs are run connected to adirectory where the job jar has been unpacked. The classpath containstwo directories from the jar, the top-level directory and the 'classes/'directory, plus all jar files contained in the 'lib/' directory.


Doug

Re: Serializing code to nodes: no can do?

Reply via email to