hi,

put this 

<property>
? <name>fetcher.store.content</name>
? <value>false</value>
? <description>If true, fetcher will store content.</description>
</property>

in your config file.

Alex.


 


 

-----Original Message-----
From: Mayank Kamthan <[email protected]>
To: [email protected]
Sent: Thu, 5 Mar 2009 1:24 pm
Subject: Re: what is needed to index for about 10000 domains










Hi Alxsss,

How can we disable storing of contents of pages?

Regards,
Mayank.

On Wed, Mar 4, 2009 at 9:57 AM, <[email protected]> wrote:

>
>  Hi,
>
> I also noticed that we can disable storing content of pages which I use. I
> wonder why someone needs to store content? Also, in case of files, is there
> a way to tell nutch not to download the whole file but let say 1000 bytes
> from the beginning and parse and index information only in that part of
> files?
>
> Thanks.
> A.
>
>
>
>
>
>
>
> -----Original Message-----
> From: yanky young <[email protected]>
> To: [email protected]
> Sent: Tue, 3 Mar 2009 6:41 pm
> Subject: Re: what is needed to index for about 10000 domains
>
>
>
>
>
>
>
>
>
>
> Hi:
>
> Dog cutting has ever write a wiki about hardware requirement of nutch, you
> can check it out
>
> http://wiki.apache.org/nutch/HardwareRequirements
>
> good luck
>
> yanky
>
>
> 2009/3/4 John Martyniak <[email protected]>
>
> > Regarding the machine, you could run it on anything,  it all depends what
> > kind of performance you want.  So yes you could run it on the machine
> that
> > you have or something like the linux machine that I have.  And the DSL
> > connection should be fine, you just need to make sure that it stays up
> the
> > whole time, because if not it will start erring out, and you will have to
> > re-fetch that whole segment as there is no way to pick up from where you
> > left off.
> >
> > The only reason that I merged the segments was that I had many of them,
> and
> > I wanted to build a big one before I started creating new ones, another
> > advantage of merging is that you can use that to clear out unwanted urls.
> >  For example I had a bunch of .js files in there, that I didn't want to
> have
> > as part of the index, so I cleared them out.
> >
> > I used "bin/nutch mergesegs".
> >
> > Regarding merging the other parts, I have never used, but I don't think
> > that it is necessary unless you have multiple linkdbs, etc.  in my case I
> do
> > not.
> >
> > -John
> >
> >
> >
> >
> > On Mar 3, 2009, at 7:14 PM, [email protected] wrote:
> >
> >
> >> Hi,
> >>
> >> I will need to index all links in domains then. What do you think a
> linux
> >> box like yours with DSL connection is OK to index the domains I have?
> >>
> >> Why only segments? I thought we need to merge all sub folders under
> crawl
> >> folder. What did you use for merging them?
> >>
> >> Thanks.
> >> A.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: John Martyniak <[email protected]>
> >> To: [email protected]
> >> Sent: Tue, 3 Mar 2009 3:21 pm
> >> Subject: Re: what is needed to  index for about 10000 domains
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Well the way that nutch works is that you would inject your list of
> >> domains into the DB, and that would be the starting point.  Since nutch
> uses
> >> a crawler it would grab those pages, and determine if there are any
> links on
> >> those pages, and then add them to the DB.  So the next time that you
> >> generated your urls to fetch, it would take your original list, plus the
> >> ones that it found to generate the new segment.?
> >> ?
> >>
> >> If you wanted to limit it to only pages contained on your 10000 domains,
> >> you could use the regex-urlfilter.txt file in the conf directory to
> limit it
> >> to your list.  But you would have to create a regular expression for
> each
> >> one.?
> >> ?
> >>
> >> I am not familiar with the merge script on the wiki, but have merged
> >> segments before and it did work.  But that was on Linux, don't think
> that
> >> should make a difference though.?
> >> ?
> >>
> >> -John?
> >> ?
> >>
> >> ?
> >>
> >> On Mar 3, 2009, at 5:10 PM, [email protected] wrote:?
> >> ?
> >>
> >>  ?
> >>>
> >>
> >>  Hi,?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  Thanks for the reply. I have list? of those domains only. I am not >
> sure
> >>> how many pages they have. Is a DSL? connection sufficient to > run
> nutch in
> >>> my case. Did you run nutch for all of your pages at > once or
> separately for
> >>> a given subset of them. Btw, yesterday I > tried to use merge shell
> script
> >>> that we have on wiki. It gave a lot > of errors. I run it on cygwin
> though.?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  Thanks.?
> >>>
> >>
> >>  A.?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  -----Original Message-----?
> >>>
> >>
> >>  From: John Martyniak <[email protected]>?
> >>>
> >>
> >>  To: [email protected]?
> >>>
> >>
> >>  Sent: Tue, 3 Mar 2009 1:44 pm?
> >>>
> >>
> >>  Subject: Re: what is needed to  index for about 10000 domains?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  I think that in order to answer that questions, it is necessary to >
> know
> >>> how many total pages are being indexed.??
> >>>
> >>
> >>  ??
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  I currently have ~3.5 million pages indexed, and the segment >
> >>> directories are around 45GB, The response time is relatively fast.??
> >>>
> >>
> >>  ??
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  In the test site it is running on a dual processor Dell 1850 with > 3GB
> >>> of RAM.??
> >>>
> >>
> >>  ??
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  -John??
> >>>
> >>
> >>  ??
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  On Mar 3, 2009, at 3:44 PM, [email protected] wrote:??
> >>>
> >>
> >>  ??
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  Hello,??
> >>>>
> >>>
> >>  ?
> >>>
> >>
> >>  ??
> >>>>
> >>>
> >>  ?
> >>>
> >>
> >>  I use nutch-0.9 and need to index about 10000? domains.? I want to >> >
> >>>> know? minimum requirements to hardware and memory.??
> >>>>
> >>>
> >>  ?
> >>>
> >>
> >>  ??
> >>>>
> >>>
> >>  ?
> >>>
> >>
> >>  Thanks in advance.??
> >>>>
> >>>
> >>  ?
> >>>
> >>
> >>  Alex.??
> >>>>
> >>>
> >>  ??
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >>
> >>  ?
> >>>
> >> ?
> >>
> >>
> >>
> >>
> >>
> >>
> >
>
>
>
>
>
>


-- 
Mayank Kamthan



 

Reply via email to