Pedro Guedes wrote:
For this I need to be able to register new steps in my chain and pass
them to hadoop to execute as a mapreduce job. I see two choices here:
1 - build a .job archive (main-class: mycrawler, submits jobs thru
JobClient) with my new steps and dependencies in the 'lib/' directory,
include my 'crawling-chain.xml' in the .job (to pass the configuration
of the chain to my crawler nodes) and then run it with the RunJar
utility thru a new Thread (so that i have a clean classpath, right?).
2 - in a new thread, configure my classpath to include the classes
needed by the crawling chain, write my crawler-chain to HDFS so that the
nodes can then read it on job execution, and then submit jobs thru
JobClient. When starting the mapreduce, nodes would then first read the
crawling-chain from hdfs, and then execute it on map or reduce.

Have I got it right? Which one sounds better?

If I understand your question, I think (1) is preferable. The MapReduce system copies the job jar into HDFS and then reads it from nodes when running tasks. This is optimized in several ways (file replication, caching, etc.) and is thus probably superior to implemnting something similar yourself, as described in (2).

Another question: if I use the setJar of the jobConf, will hadoop
include 'lib/*.jar' on the job.jar it sends to nodes?

Hadoop sends the entire job jar file. Jobs are run connected to a directory where the job jar has been unpacked. The classpath contains two directories from the jar, the top-level directory and the 'classes/' directory, plus all jar files contained in the 'lib/' directory.

Doug

Reply via email to