Pedro Guedes wrote:
For this I need to be able to register new steps in my chain and pass
them to hadoop to execute as a mapreduce job. I see two choices here:
1 - build a .job archive (main-class: mycrawler, submits jobs thru
JobClient) with my new steps and dependencies in the 'lib/'
I'm trying to pass a crawling chain (which is the steps to execute
while crawling a resource) as configuration. And then execute the chain
to each resource I find in my crawl database (like nutch's crawldb).
For this I need to be able to register new steps in my chain and pass
them to hadoop to
Hi hadoopers,
I'm working on an enterprise search engine that works on an hadoop
cluster but is controlled form the outside. I managed to implement a
simple crawler much like Nutch's...
Now i have a new system's requirement: the crawl process must be
configurable outside hadoop. This means that I
I keep talking to myself... hope it doesn't annoy u too much!
We thought of a solution to our problem in wich we build a new .job
file, in accordance with our crawl configuration, and then pass it to
hadoop for execution... Is there somewhere i can look for the
specification of the .job format?
I'm not sure exactly what you're trying to do, but you can specify command
line parameters to hadoop -jar which you can interpret in your code. Your
code can then write arbitrary config parameters before starting the
mapreduce. Based on these configs, you can load specific jars in your
mapreduce