I'm trying to pass a "crawling chain" (which is the steps to execute while crawling a resource) as configuration. And then execute the chain to each resource I find in my crawl database (like nutch's crawldb).
For this I need to be able to register new steps in my chain and pass them to hadoop to execute as a mapreduce job. I see two choices here: 1 - build a .job archive (main-class: mycrawler, submits jobs thru JobClient) with my new steps and dependencies in the 'lib/' directory, include my 'crawling-chain.xml' in the .job (to pass the configuration of the chain to my crawler nodes) and then run it with the RunJar utility thru a new Thread (so that i have a clean classpath, right?). 2 - in a new thread, configure my classpath to include the classes needed by the crawling chain, write my crawler-chain to HDFS so that the nodes can then read it on job execution, and then submit jobs thru JobClient. When starting the mapreduce, nodes would then first read the crawling-chain from hdfs, and then execute it on map or reduce. Have I got it right? Which one sounds better? Another question: if I use the setJar of the jobConf, will hadoop include 'lib/*.jar' on the job.jar it sends to nodes? Pedro Michael Bieniosek wrote: > I'm not sure exactly what you're trying to do, but you can specify command > line parameters to hadoop -jar which you can interpret in your code. Your > code can then write arbitrary config parameters before starting the > mapreduce. Based on these configs, you can load specific jars in your > mapreduce tasks. But I'm not sure why you'd need to do this, since you > should be able to include any new code in the job jar you submit to hadoop. > > -Michael > > On 4/18/07 11:23 AM, "Pedro Guedes" <[EMAIL PROTECTED]> wrote: > > >> I keep talking to myself... hope it doesn't annoy u too much! >> >> We thought of a solution to our problem in wich we build a new .job >> file, in accordance with our crawl configuration, and then pass it to >> hadoop for execution... Is there somewhere i can look for the >> specification of the .job format? >> >> Thanks again, >> >> Pedro >> >> I wrote: >> >>> Hi hadoopers, >>> >>> I'm working on an enterprise search engine that works on an hadoop >>> cluster but is controlled form the outside. I managed to implement a >>> simple crawler much like Nutch's... >>> Now i have a new system's requirement: the crawl process must be >>> configurable outside hadoop. This means that I should be able to add >>> steps to the crawling process that the cluster would execute without >>> knowing before hand what they are... since serialization if not >>> possible, is there another way to achieve the same effect? >>> >>> Using Writable means I need implementations to be on each node so they >>> can read the object data from HDFS... but then i just get the same >>> object and not a new implementation, right? >>> >>> Any thoughts will be appreciated, >>> >>> Pedro >>> DISCLAIMER: This message may contain confidential information or privileged material and is intended only for the individual(s) named. If you are not a named addressee and mistakenly received this message you should not copy or otherwise disseminate it: please delete this e-mail from your system and notify the sender immediately. E-mail transmissions are not guaranteed to be secure or without errors as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. Therefore, the sender does not accept liability for any errors or omissions in the contents of this message that arise as a result of e-mail transmissions. Please request a hard copy version if verification is required. Critical Software, SA.
