Re: Two Nutch parallel crawl with two conf folder.

eks dev Tue, 09 Mar 2010 09:11:44 -0800

sorry for the noise.. I've mixed up Emails



----- Original Message ----
> From: eks dev <eks...@yahoo.co.uk>
> To: nutch-user@lucene.apache.org
> Sent: Tue, 9 March, 2010 18:07:47
> Subject: Re: Two Nutch parallel crawl with two conf folder.
> 
> coool answer
> 
> 
> 
> ----- Original Message ----
> > From: MilleBii 
> > To: nutch-user@lucene.apache.org
> > Sent: Tue, 9 March, 2010 8:35:42
> > Subject: Re: Two Nutch parallel crawl with two conf folder.
> > 
> > Yes it should work, I personnaly run some tests crawl on the same
> > hardware, even on the same nutch directory thus I share the conf
> > directory.
> > But If you don't want that I would use two nutch directory and of
> > course two different crawl directory because with hadoop they will
> > end-up on the same hdfs: (assuming you run in distribued or pseudo)
> > 
> > 2010/3/9, Pravin Karne :
> > >
> > > Can we share Hadoop cluster between two nutch instance.
> > > So there will be two nutch instance and they will point to same Hadoop
> > > cluster.
> > >
> > > This way I am able to share my hardware bandwidth. I know that Hadoop in
> > > distributed mode serializes jobs.
> > > But I will not affect my flow. I just want to share my hardware resource.
> > >
> > > I tried with two nutch setup , but somehow second instance overriding the
> > > first one's configuration.
> > >
> > >
> > > Any pointers ?????
> > >
> > > Thanks
> > > -Pravin
> > >
> > >
> > > -----Original Message-----
> > > From: MilleBii [mailto:mille...@gmail.com]
> > > Sent: Monday, March 08, 2010 8:02 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: Re: Two Nutch parallel crawl with two conf folder.
> > >
> > > How parallel is parallel in your case ?
> > > Don't forget Hadoop in distributed mode will serialize your jobs anyhow.
> > >
> > > For the rest why don't you create two Nutch directories and run things
> > > totally independently
> > >
> > >
> > > 2010/3/8, Pravin Karne :
> > >> Hi guys any pointer on following.
> > >> Your help will highly appreciated .
> > >>
> > >> Thanks
> > >> -Pravin
> > >>
> > >> -----Original Message-----
> > >> From: Pravin Karne
> > >> Sent: Friday, March 05, 2010 12:57 PM
> > >> To: nutch-user@lucene.apache.org
> > >> Subject: Two Nutch parallel crawl with two conf folder.
> > >>
> > >> Hi,
> > >>
> > >> I want to do two Nutch parallel crawl with two conf folder.
> > >>
> > >> I am using crawl command to do this. I have two separate conf folders,
> > >> all
> > >> files from conf are same except crawl-urlfilter.txt . In  this file we
> > >> have
> > >> different filters(domain filters).
> > >>
> > >>  e.g . 1 st conf have -
> > >>              +.^http://([a-z0-9]*\.)*abc.com/
> > >>
> > >>        2nd conf have -
> > >>         +.^http://([a-z0-9]*\.)*xyz.com/
> > >>
> > >>
> > >> I am starting two crawl with above configuration and on separate
> > >> console.(one followed by other)
> > >>
> > >> I am using following crawl commands  -
> > >>
> > >>       bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth
> > >> 1
> > >>
> > >>       bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth
> > >> 1
> > >>
> > >> [Note: We have modified nutch.sh for '--nutch_conf_dir']
> > >>
> > >> urls file have following entries-
> > >>
> > >>    http://www.abc.com
> > >>    http://www.xyz.com
> > >>    http://www.pqr.com
> > >>
> > >>
> > >> Expected Result:
> > >>
> > >>      CrawlDB test1 should contains abc.com's  data and CrawlDB test2
> > >> should
> > >> contains xyz.com's data.
> > >>
> > >> Actual Results:
> > >>
> > >>   url filter of first run  is overridden by url filter of second run.
> > >>
> > >>   So Both CrawlDB have xyz.com's data.
> > >>
> > >>
> > >> Please provide pointer regarding this.
> > >>
> > >> Thanks in advance.
> > >>
> > >> -Pravin
> > >>
> > >>
> > >> DISCLAIMER
> > >> ==========
> > >> This e-mail may contain privileged and confidential information which is
> > >> the
> > >> property of Persistent Systems Ltd. It is intended only for the use of
> > >> the
> > >> individual or entity to which it is addressed. If you are not the
> > >> intended
> > >> recipient, you are not authorized to read, retain, copy, print,
> > >> distribute
> > >> or use this message. If you have received this communication in error,
> > >> please notify the sender and delete all copies of this message.
> > >> Persistent
> > >> Systems Ltd. does not accept any liability for virus infected mails.
> > >>
> > >
> > >
> > > --
> > > -MilleBii-
> > >
> > > DISCLAIMER
> > > ==========
> > > This e-mail may contain privileged and confidential information which is 
> > > the
> > > property of Persistent Systems Ltd. It is intended only for the use of the
> > > individual or entity to which it is addressed. If you are not the intended
> > > recipient, you are not authorized to read, retain, copy, print, distribute
> > > or use this message. If you have received this communication in error,
> > > please notify the sender and delete all copies of this message. Persistent
> > > Systems Ltd. does not accept any liability for virus infected mails.
> > >
> > 
> > 
> > -- 
> > -MilleBii-

Re: Two Nutch parallel crawl with two conf folder.

Reply via email to