coool answer
----- Original Message ---- > From: MilleBii <mille...@gmail.com> > To: nutch-user@lucene.apache.org > Sent: Tue, 9 March, 2010 8:35:42 > Subject: Re: Two Nutch parallel crawl with two conf folder. > > Yes it should work, I personnaly run some tests crawl on the same > hardware, even on the same nutch directory thus I share the conf > directory. > But If you don't want that I would use two nutch directory and of > course two different crawl directory because with hadoop they will > end-up on the same hdfs: (assuming you run in distribued or pseudo) > > 2010/3/9, Pravin Karne : > > > > Can we share Hadoop cluster between two nutch instance. > > So there will be two nutch instance and they will point to same Hadoop > > cluster. > > > > This way I am able to share my hardware bandwidth. I know that Hadoop in > > distributed mode serializes jobs. > > But I will not affect my flow. I just want to share my hardware resource. > > > > I tried with two nutch setup , but somehow second instance overriding the > > first one's configuration. > > > > > > Any pointers ????? > > > > Thanks > > -Pravin > > > > > > -----Original Message----- > > From: MilleBii [mailto:mille...@gmail.com] > > Sent: Monday, March 08, 2010 8:02 PM > > To: nutch-user@lucene.apache.org > > Subject: Re: Two Nutch parallel crawl with two conf folder. > > > > How parallel is parallel in your case ? > > Don't forget Hadoop in distributed mode will serialize your jobs anyhow. > > > > For the rest why don't you create two Nutch directories and run things > > totally independently > > > > > > 2010/3/8, Pravin Karne : > >> Hi guys any pointer on following. > >> Your help will highly appreciated . > >> > >> Thanks > >> -Pravin > >> > >> -----Original Message----- > >> From: Pravin Karne > >> Sent: Friday, March 05, 2010 12:57 PM > >> To: nutch-user@lucene.apache.org > >> Subject: Two Nutch parallel crawl with two conf folder. > >> > >> Hi, > >> > >> I want to do two Nutch parallel crawl with two conf folder. > >> > >> I am using crawl command to do this. I have two separate conf folders, > >> all > >> files from conf are same except crawl-urlfilter.txt . In this file we > >> have > >> different filters(domain filters). > >> > >> e.g . 1 st conf have - > >> +.^http://([a-z0-9]*\.)*abc.com/ > >> > >> 2nd conf have - > >> +.^http://([a-z0-9]*\.)*xyz.com/ > >> > >> > >> I am starting two crawl with above configuration and on separate > >> console.(one followed by other) > >> > >> I am using following crawl commands - > >> > >> bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth > >> 1 > >> > >> bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth > >> 1 > >> > >> [Note: We have modified nutch.sh for '--nutch_conf_dir'] > >> > >> urls file have following entries- > >> > >> http://www.abc.com > >> http://www.xyz.com > >> http://www.pqr.com > >> > >> > >> Expected Result: > >> > >> CrawlDB test1 should contains abc.com's data and CrawlDB test2 > >> should > >> contains xyz.com's data. > >> > >> Actual Results: > >> > >> url filter of first run is overridden by url filter of second run. > >> > >> So Both CrawlDB have xyz.com's data. > >> > >> > >> Please provide pointer regarding this. > >> > >> Thanks in advance. > >> > >> -Pravin > >> > >> > >> DISCLAIMER > >> ========== > >> This e-mail may contain privileged and confidential information which is > >> the > >> property of Persistent Systems Ltd. It is intended only for the use of > >> the > >> individual or entity to which it is addressed. If you are not the > >> intended > >> recipient, you are not authorized to read, retain, copy, print, > >> distribute > >> or use this message. If you have received this communication in error, > >> please notify the sender and delete all copies of this message. > >> Persistent > >> Systems Ltd. does not accept any liability for virus infected mails. > >> > > > > > > -- > > -MilleBii- > > > > DISCLAIMER > > ========== > > This e-mail may contain privileged and confidential information which is the > > property of Persistent Systems Ltd. It is intended only for the use of the > > individual or entity to which it is addressed. If you are not the intended > > recipient, you are not authorized to read, retain, copy, print, distribute > > or use this message. If you have received this communication in error, > > please notify the sender and delete all copies of this message. Persistent > > Systems Ltd. does not accept any liability for virus infected mails. > > > > > -- > -MilleBii-