Re: [Nutch-general] Re: Memory usage2

ogjunk-nutch Tue, 02 Aug 2005 13:12:40 -0700

Wow, a pile of questions. :)
Is this for a web-wide search engine?

Otis



--- Jay Pound <[EMAIL PROTECTED]> wrote:

> whats the bottleneck for the slow searching, I'm monitoring it and
> its doing
> about 57% cpu load when I'm searching , it takes about 50secs to
> bring up
> the results page the first time, then if I search for the same thing
> again
> its much faster.
> Doug, can I trash my segments after they are indexed, I don't want to
> have
> cached access to the pages do the segments still need to be there? my
> 30mil
> page index/segment is using over 300gb I have the space, but when I
> get to
> the hundreds of millions of pages I will run out of room on my raid
> controler's for hd expansion, I'm planning on moving to lustre if
> ndfs is
> not stable by then. I plan on having a multi billion page index if
> the
> memory requirements for that can be below 16gb per search node. right
> now
> I'm getting pretty crappy results from my 30 million pages, I read
> the
> whitepaper on Authoritative Sources in a Hyperlinked Environment
> because
> someone said thats how the nutch algorithm worked, so I'm assuming as
> my
> index grows the pages that deserve top placement will recieve top
> placement,
> but I don't know if I should re-fetch a new set of segments with root
> url's
> just ending in US extensions(.com.edu etc...) I made a small set
> testing
> this theory (100000 pages) and its results were much better than my
> results
> from the 30mill page index. whats your thought on this, am I right in
> thinking that the pages with the most pages linking to them will show
> up
> first? so if I index 500 million pages my results should be on par
> with the
> rest of the "big dogs"?
> 
> one last important question, if I merge my indexes will it search
> faster
> than if I don't merge them, I currently have 20 directories of
> 1-1.7mill
> pages each.
> and if I split up these indexes across multiple machines will the
> searching
> be faster, I couldent get the nutch-server to work but I'm using 0.6.
> 
> I have a very fast server I didnt know if the searching would take
> advantage
> of smp, fetching will and I can run multiple index's at the same
> time. my HD
> array is 200MB a sec i/o I have the new dual core opteron 275 italy
> core
> with 4gb ram, working my way to 16gb when I need it and a second
> processor
> when I need it, 1.28TB of hd space for nutch currently with expansion
> up to
> 5.12TB, I'm currently running windows 2000 on it as they havent made
> a
> driver yet for suse 9.3 for my raid cards (highpoint 2220) so my
> scalability
> will be to 960MB a sec with all the drives in the system and 4x2.2
> Ghz
> processor cores. untill I need to cluster thats what I have to play
> with for
> nutch.
> in case you guys needed to know what hardware I'm running
> Thank you
> -Jay Pound
> Fromped.com
> BTW windows 2000 is not 100% stable with dual core processors. nutch
> is ok
> but cant do too many things at once or I'll get a kernel inpage error
> (guess
> its time to migrate to 2003.net server-damn)
> ----- Original Message ----- 
> From: "Doug Cutting" <[EMAIL PROTECTED]>
> To: <[email protected]>
> Sent: Tuesday, August 02, 2005 1:53 PM
> Subject: Re: Memory usage
> 
> 
> > Try the following settings in your nutch-site.xml:
> >
> > <property>
> >    <name>io.map.index.skip</name>
> >    <value>7</value>
> > </property>
> >
> > <property>
> >    <name>indexer.termIndexInterval</name>
> >    <value>1024</value>
> > </property>
> >
> > The first causes data files to use considerably less memory.
> >
> > The second affects index creation, so must be done before you
> create the
> > index you search.  It's okay if your segment indexes were created
> > without this, you can just (re-)merge indexes and the merged index
> will
> > get the setting and use less memory when searching.
> >
> > Combining these two I have searched a 40+M page index on a machine
> using
> > about 500MB of RAM.  That said, search times with such a large
> index are
> > not good.  At some point, as your collection grows, you will want
> to
> > merge multiple indexes containing different subsets of segments and
> put
> > each on a separate box and search them with distributed search.
> >
> > Doug
> >
> > Jay Pound wrote:
> > > I'm testing an index of 30 million pages, it requires 1.5gb of
> ram to
> search
> > > using tomcat 5, I plan on having an index with multiple billion
> pages,
> but
> > > if this is to scale then even with 16GB of ram I wont be able to
> have an
> > > index larger than 320million pages? how can I distribute the
> memory
> > > requirements across multiple machines, or is there another
> servlet
> program
> > > (like resin) that will require less memory to operate, has anyone
> else
> run
> > > into this?
> > > Thanks,
> > > -Jay Pound
> > >
> > >
> >
> >
> 
> 
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration
> Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> _______________________________________________
> Nutch-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>

Re: [Nutch-general] Re: Memory usage2

Reply via email to