Hi: Dog cutting has ever write a wiki about hardware requirement of nutch, you can check it out
http://wiki.apache.org/nutch/HardwareRequirements good luck yanky 2009/3/4 John Martyniak <[email protected]> > Regarding the machine, you could run it on anything, it all depends what > kind of performance you want. So yes you could run it on the machine that > you have or something like the linux machine that I have. And the DSL > connection should be fine, you just need to make sure that it stays up the > whole time, because if not it will start erring out, and you will have to > re-fetch that whole segment as there is no way to pick up from where you > left off. > > The only reason that I merged the segments was that I had many of them, and > I wanted to build a big one before I started creating new ones, another > advantage of merging is that you can use that to clear out unwanted urls. > For example I had a bunch of .js files in there, that I didn't want to have > as part of the index, so I cleared them out. > > I used "bin/nutch mergesegs". > > Regarding merging the other parts, I have never used, but I don't think > that it is necessary unless you have multiple linkdbs, etc. in my case I do > not. > > -John > > > > > On Mar 3, 2009, at 7:14 PM, [email protected] wrote: > > >> Hi, >> >> I will need to index all links in domains then. What do you think a linux >> box like yours with DSL connection is OK to index the domains I have? >> >> Why only segments? I thought we need to merge all sub folders under crawl >> folder. What did you use for merging them? >> >> Thanks. >> A. >> >> >> >> >> >> >> >> -----Original Message----- >> From: John Martyniak <[email protected]> >> To: [email protected] >> Sent: Tue, 3 Mar 2009 3:21 pm >> Subject: Re: what is needed to index for about 10000 domains >> >> >> >> >> >> >> >> >> >> Well the way that nutch works is that you would inject your list of >> domains into the DB, and that would be the starting point. Since nutch uses >> a crawler it would grab those pages, and determine if there are any links on >> those pages, and then add them to the DB. So the next time that you >> generated your urls to fetch, it would take your original list, plus the >> ones that it found to generate the new segment.? >> ? >> >> If you wanted to limit it to only pages contained on your 10000 domains, >> you could use the regex-urlfilter.txt file in the conf directory to limit it >> to your list. But you would have to create a regular expression for each >> one.? >> ? >> >> I am not familiar with the merge script on the wiki, but have merged >> segments before and it did work. But that was on Linux, don't think that >> should make a difference though.? >> ? >> >> -John? >> ? >> >> ? >> >> On Mar 3, 2009, at 5:10 PM, [email protected] wrote:? >> ? >> >> ? >>> >> >> Hi,? >>> >> >> ? >>> >> >> Thanks for the reply. I have list? of those domains only. I am not > sure >>> how many pages they have. Is a DSL? connection sufficient to > run nutch in >>> my case. Did you run nutch for all of your pages at > once or separately for >>> a given subset of them. Btw, yesterday I > tried to use merge shell script >>> that we have on wiki. It gave a lot > of errors. I run it on cygwin though.? >>> >> >> ? >>> >> >> Thanks.? >>> >> >> A.? >>> >> >> ? >>> >> >> ? >>> >> >> ? >>> >> >> ? >>> >> >> ? >>> >> >> ? >>> >> >> ? >>> >> >> -----Original Message-----? >>> >> >> From: John Martyniak <[email protected]>? >>> >> >> To: [email protected]? >>> >> >> Sent: Tue, 3 Mar 2009 1:44 pm? >>> >> >> Subject: Re: what is needed to index for about 10000 domains? >>> >> >> ? >>> >> >> ? >>> >> >> ? >>> >> >> ? >>> >> >> ? >>> >> >> ? >>> >> >> ? >>> >> >> ? >>> >> >> ? >>> >> >> I think that in order to answer that questions, it is necessary to > know >>> how many total pages are being indexed.?? >>> >> >> ?? >>> >> >> ? >>> >> >> I currently have ~3.5 million pages indexed, and the segment > >>> directories are around 45GB, The response time is relatively fast.?? >>> >> >> ?? >>> >> >> ? >>> >> >> In the test site it is running on a dual processor Dell 1850 with > 3GB >>> of RAM.?? >>> >> >> ?? >>> >> >> ? >>> >> >> -John?? >>> >> >> ?? >>> >> >> ? >>> >> >> On Mar 3, 2009, at 3:44 PM, [email protected] wrote:?? >>> >> >> ?? >>> >> >> ? >>> >> >> Hello,?? >>>> >>> >> ? >>> >> >> ?? >>>> >>> >> ? >>> >> >> I use nutch-0.9 and need to index about 10000? domains.? I want to >> > >>>> know? minimum requirements to hardware and memory.?? >>>> >>> >> ? >>> >> >> ?? >>>> >>> >> ? >>> >> >> Thanks in advance.?? >>>> >>> >> ? >>> >> >> Alex.?? >>>> >>> >> ?? >>> >> >> ? >>> >> >> ? >>> >> >> ? >>> >> >> ? >>> >> >> ? >>> >> ? >> >> >> >> >> >> >
