Re: Two Nutch parallel crawl with two conf folder.

MilleBii Mon, 08 Mar 2010 06:32:36 -0800

How parallel is parallel in your case ?
Don't forget Hadoop in distributed mode will serialize your jobs anyhow.


For the rest why don't you create two Nutch directories and run things
totally independently


2010/3/8, Pravin Karne <pravin_ka...@persistent.co.in>:
> Hi guys any pointer on following.
> Your help will highly appreciated .
>
> Thanks
> -Pravin
>
> -----Original Message-----
> From: Pravin Karne
> Sent: Friday, March 05, 2010 12:57 PM
> To: nutch-user@lucene.apache.org
> Subject: Two Nutch parallel crawl with two conf folder.
>
> Hi,
>
> I want to do two Nutch parallel crawl with two conf folder.
>
> I am using crawl command to do this. I have two separate conf folders, all
> files from conf are same except crawl-urlfilter.txt . In  this file we have
> different filters(domain filters).
>
>  e.g . 1 st conf have -
>              +.^http://([a-z0-9]*\.)*abc.com/
>
>        2nd conf have -
>               +.^http://([a-z0-9]*\.)*xyz.com/
>
>
> I am starting two crawl with above configuration and on separate
> console.(one followed by other)
>
> I am using following crawl commands  -
>
>       bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth 1
>
>       bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth 1
>
> [Note: We have modified nutch.sh for '--nutch_conf_dir']
>
> urls file have following entries-
>
>     http://www.abc.com
>     http://www.xyz.com
>     http://www.pqr.com
>
>
> Expected Result:
>
>      CrawlDB test1 should contains abc.com's  data and CrawlDB test2 should
> contains xyz.com's data.
>
> Actual Results:
>
>   url filter of first run  is overridden by url filter of second run.
>
>   So Both CrawlDB have xyz.com's data.
>
>
> Please provide pointer regarding this.
>
> Thanks in advance.
>
> -Pravin
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the
> property of Persistent Systems Ltd. It is intended only for the use of the
> individual or entity to which it is addressed. If you are not the intended
> recipient, you are not authorized to read, retain, copy, print, distribute
> or use this message. If you have received this communication in error,
> please notify the sender and delete all copies of this message. Persistent
> Systems Ltd. does not accept any liability for virus infected mails.
>


-- 
-MilleBii-

Re: Two Nutch parallel crawl with two conf folder.

Reply via email to