sorry for the noise.. I've mixed up Emails
----- Original Message ---- > From: eks dev <eks...@yahoo.co.uk> > To: nutch-user@lucene.apache.org > Sent: Tue, 9 March, 2010 18:07:47 > Subject: Re: Two Nutch parallel crawl with two conf folder. > > coool answer > > > > ----- Original Message ---- > > From: MilleBii > > To: nutch-user@lucene.apache.org > > Sent: Tue, 9 March, 2010 8:35:42 > > Subject: Re: Two Nutch parallel crawl with two conf folder. > > > > Yes it should work, I personnaly run some tests crawl on the same > > hardware, even on the same nutch directory thus I share the conf > > directory. > > But If you don't want that I would use two nutch directory and of > > course two different crawl directory because with hadoop they will > > end-up on the same hdfs: (assuming you run in distribued or pseudo) > > > > 2010/3/9, Pravin Karne : > > > > > > Can we share Hadoop cluster between two nutch instance. > > > So there will be two nutch instance and they will point to same Hadoop > > > cluster. > > > > > > This way I am able to share my hardware bandwidth. I know that Hadoop in > > > distributed mode serializes jobs. > > > But I will not affect my flow. I just want to share my hardware resource. > > > > > > I tried with two nutch setup , but somehow second instance overriding the > > > first one's configuration. > > > > > > > > > Any pointers ????? > > > > > > Thanks > > > -Pravin > > > > > > > > > -----Original Message----- > > > From: MilleBii [mailto:mille...@gmail.com] > > > Sent: Monday, March 08, 2010 8:02 PM > > > To: nutch-user@lucene.apache.org > > > Subject: Re: Two Nutch parallel crawl with two conf folder. > > > > > > How parallel is parallel in your case ? > > > Don't forget Hadoop in distributed mode will serialize your jobs anyhow. > > > > > > For the rest why don't you create two Nutch directories and run things > > > totally independently > > > > > > > > > 2010/3/8, Pravin Karne : > > >> Hi guys any pointer on following. > > >> Your help will highly appreciated . > > >> > > >> Thanks > > >> -Pravin > > >> > > >> -----Original Message----- > > >> From: Pravin Karne > > >> Sent: Friday, March 05, 2010 12:57 PM > > >> To: nutch-user@lucene.apache.org > > >> Subject: Two Nutch parallel crawl with two conf folder. > > >> > > >> Hi, > > >> > > >> I want to do two Nutch parallel crawl with two conf folder. > > >> > > >> I am using crawl command to do this. I have two separate conf folders, > > >> all > > >> files from conf are same except crawl-urlfilter.txt . In this file we > > >> have > > >> different filters(domain filters). > > >> > > >> e.g . 1 st conf have - > > >> +.^http://([a-z0-9]*\.)*abc.com/ > > >> > > >> 2nd conf have - > > >> +.^http://([a-z0-9]*\.)*xyz.com/ > > >> > > >> > > >> I am starting two crawl with above configuration and on separate > > >> console.(one followed by other) > > >> > > >> I am using following crawl commands - > > >> > > >> bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth > > >> 1 > > >> > > >> bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth > > >> 1 > > >> > > >> [Note: We have modified nutch.sh for '--nutch_conf_dir'] > > >> > > >> urls file have following entries- > > >> > > >> http://www.abc.com > > >> http://www.xyz.com > > >> http://www.pqr.com > > >> > > >> > > >> Expected Result: > > >> > > >> CrawlDB test1 should contains abc.com's data and CrawlDB test2 > > >> should > > >> contains xyz.com's data. > > >> > > >> Actual Results: > > >> > > >> url filter of first run is overridden by url filter of second run. > > >> > > >> So Both CrawlDB have xyz.com's data. > > >> > > >> > > >> Please provide pointer regarding this. > > >> > > >> Thanks in advance. > > >> > > >> -Pravin > > >> > > >> > > >> DISCLAIMER > > >> ========== > > >> This e-mail may contain privileged and confidential information which is > > >> the > > >> property of Persistent Systems Ltd. It is intended only for the use of > > >> the > > >> individual or entity to which it is addressed. If you are not the > > >> intended > > >> recipient, you are not authorized to read, retain, copy, print, > > >> distribute > > >> or use this message. If you have received this communication in error, > > >> please notify the sender and delete all copies of this message. > > >> Persistent > > >> Systems Ltd. does not accept any liability for virus infected mails. > > >> > > > > > > > > > -- > > > -MilleBii- > > > > > > DISCLAIMER > > > ========== > > > This e-mail may contain privileged and confidential information which is > > > the > > > property of Persistent Systems Ltd. It is intended only for the use of the > > > individual or entity to which it is addressed. If you are not the intended > > > recipient, you are not authorized to read, retain, copy, print, distribute > > > or use this message. If you have received this communication in error, > > > please notify the sender and delete all copies of this message. Persistent > > > Systems Ltd. does not accept any liability for virus infected mails. > > > > > > > > > -- > > -MilleBii-