hi, put this
<property> ? <name>fetcher.store.content</name> ? <value>false</value> ? <description>If true, fetcher will store content.</description> </property> in your config file. Alex. -----Original Message----- From: Mayank Kamthan <[email protected]> To: [email protected] Sent: Thu, 5 Mar 2009 1:24 pm Subject: Re: what is needed to index for about 10000 domains Hi Alxsss, How can we disable storing of contents of pages? Regards, Mayank. On Wed, Mar 4, 2009 at 9:57 AM, <[email protected]> wrote: > > Hi, > > I also noticed that we can disable storing content of pages which I use. I > wonder why someone needs to store content? Also, in case of files, is there > a way to tell nutch not to download the whole file but let say 1000 bytes > from the beginning and parse and index information only in that part of > files? > > Thanks. > A. > > > > > > > > -----Original Message----- > From: yanky young <[email protected]> > To: [email protected] > Sent: Tue, 3 Mar 2009 6:41 pm > Subject: Re: what is needed to index for about 10000 domains > > > > > > > > > > > Hi: > > Dog cutting has ever write a wiki about hardware requirement of nutch, you > can check it out > > http://wiki.apache.org/nutch/HardwareRequirements > > good luck > > yanky > > > 2009/3/4 John Martyniak <[email protected]> > > > Regarding the machine, you could run it on anything, it all depends what > > kind of performance you want. So yes you could run it on the machine > that > > you have or something like the linux machine that I have. And the DSL > > connection should be fine, you just need to make sure that it stays up > the > > whole time, because if not it will start erring out, and you will have to > > re-fetch that whole segment as there is no way to pick up from where you > > left off. > > > > The only reason that I merged the segments was that I had many of them, > and > > I wanted to build a big one before I started creating new ones, another > > advantage of merging is that you can use that to clear out unwanted urls. > > For example I had a bunch of .js files in there, that I didn't want to > have > > as part of the index, so I cleared them out. > > > > I used "bin/nutch mergesegs". > > > > Regarding merging the other parts, I have never used, but I don't think > > that it is necessary unless you have multiple linkdbs, etc. in my case I > do > > not. > > > > -John > > > > > > > > > > On Mar 3, 2009, at 7:14 PM, [email protected] wrote: > > > > > >> Hi, > >> > >> I will need to index all links in domains then. What do you think a > linux > >> box like yours with DSL connection is OK to index the domains I have? > >> > >> Why only segments? I thought we need to merge all sub folders under > crawl > >> folder. What did you use for merging them? > >> > >> Thanks. > >> A. > >> > >> > >> > >> > >> > >> > >> > >> -----Original Message----- > >> From: John Martyniak <[email protected]> > >> To: [email protected] > >> Sent: Tue, 3 Mar 2009 3:21 pm > >> Subject: Re: what is needed to index for about 10000 domains > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> Well the way that nutch works is that you would inject your list of > >> domains into the DB, and that would be the starting point. Since nutch > uses > >> a crawler it would grab those pages, and determine if there are any > links on > >> those pages, and then add them to the DB. So the next time that you > >> generated your urls to fetch, it would take your original list, plus the > >> ones that it found to generate the new segment.? > >> ? > >> > >> If you wanted to limit it to only pages contained on your 10000 domains, > >> you could use the regex-urlfilter.txt file in the conf directory to > limit it > >> to your list. But you would have to create a regular expression for > each > >> one.? > >> ? > >> > >> I am not familiar with the merge script on the wiki, but have merged > >> segments before and it did work. But that was on Linux, don't think > that > >> should make a difference though.? > >> ? > >> > >> -John? > >> ? > >> > >> ? > >> > >> On Mar 3, 2009, at 5:10 PM, [email protected] wrote:? > >> ? > >> > >> ? > >>> > >> > >> Hi,? > >>> > >> > >> ? > >>> > >> > >> Thanks for the reply. I have list? of those domains only. I am not > > sure > >>> how many pages they have. Is a DSL? connection sufficient to > run > nutch in > >>> my case. Did you run nutch for all of your pages at > once or > separately for > >>> a given subset of them. Btw, yesterday I > tried to use merge shell > script > >>> that we have on wiki. It gave a lot > of errors. I run it on cygwin > though.? > >>> > >> > >> ? > >>> > >> > >> Thanks.? > >>> > >> > >> A.? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> -----Original Message-----? > >>> > >> > >> From: John Martyniak <[email protected]>? > >>> > >> > >> To: [email protected]? > >>> > >> > >> Sent: Tue, 3 Mar 2009 1:44 pm? > >>> > >> > >> Subject: Re: what is needed to index for about 10000 domains? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> I think that in order to answer that questions, it is necessary to > > know > >>> how many total pages are being indexed.?? > >>> > >> > >> ?? > >>> > >> > >> ? > >>> > >> > >> I currently have ~3.5 million pages indexed, and the segment > > >>> directories are around 45GB, The response time is relatively fast.?? > >>> > >> > >> ?? > >>> > >> > >> ? > >>> > >> > >> In the test site it is running on a dual processor Dell 1850 with > 3GB > >>> of RAM.?? > >>> > >> > >> ?? > >>> > >> > >> ? > >>> > >> > >> -John?? > >>> > >> > >> ?? > >>> > >> > >> ? > >>> > >> > >> On Mar 3, 2009, at 3:44 PM, [email protected] wrote:?? > >>> > >> > >> ?? > >>> > >> > >> ? > >>> > >> > >> Hello,?? > >>>> > >>> > >> ? > >>> > >> > >> ?? > >>>> > >>> > >> ? > >>> > >> > >> I use nutch-0.9 and need to index about 10000? domains.? I want to >> > > >>>> know? minimum requirements to hardware and memory.?? > >>>> > >>> > >> ? > >>> > >> > >> ?? > >>>> > >>> > >> ? > >>> > >> > >> Thanks in advance.?? > >>>> > >>> > >> ? > >>> > >> > >> Alex.?? > >>>> > >>> > >> ?? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> > >> ? > >>> > >> ? > >> > >> > >> > >> > >> > >> > > > > > > > > -- Mayank Kamthan
