Two Nutch parallel crawl with two conf folder.

Pravin Karne Thu, 04 Mar 2010 23:26:32 -0800

Hi,

I want to do two Nutch parallel crawl with two conf folder.


I am using crawl command to do this. I have two separate conf folders, all 
files from conf are same except crawl-urlfilter.txt . In  this file we have 
different filters(domain filters).

 e.g . 1 st conf have -
             +.^http://([a-z0-9]*\.)*abc.com/

       2nd conf have -
                +.^http://([a-z0-9]*\.)*xyz.com/


I am starting two crawl with above configuration and on separate console.(one 
followed by other)

I am using following crawl commands  -

      bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth 1

      bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth 1

[Note: We have modified nutch.sh for '--nutch_conf_dir']

urls file have following entries-

    http://www.abc.com
    http://www.xyz.com
    http://www.pqr.com


Expected Result:

     CrawlDB test1 should contains abc.com's  data and CrawlDB test2 should 
contains xyz.com's data.

Actual Results:

  url filter of first run  is overridden by url filter of second run.

  So Both CrawlDB have xyz.com's data.


Please provide pointer regarding this.

Thanks in advance.

-Pravin


DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Two Nutch parallel crawl with two conf folder.

Reply via email to