this is going to be a web wide search engine, I just want to be able to set
it up for each language, right now it returns results for all languages, so
the results are not so good
I'm trying to get pruning to work but don't know how, then I'll make an
smaller index for each language out of a larger index containing all
languages.
-J
----- Original Message ----- 
From: "Sébastien LE CALLONNEC" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Tuesday, August 02, 2005 4:34 PM
Subject: Re: [Nutch-general] Re: Memory usage2


> Obviously not:  it must be for « [urls] just ending in US
> extensions(.com.edu etc...) ». :))
>
> Anyway, it all sounds very impressive!  Good luck with your
> investigations and please keep us posted.
>
>
> Regards,
> Sébastien.
>
>
> --- [EMAIL PROTECTED] a écrit :
>
> > Wow, a pile of questions. :)
> > Is this for a web-wide search engine?
> >
> > Otis
> >
> >
> > --- Jay Pound <[EMAIL PROTECTED]> wrote:
> >
> > > whats the bottleneck for the slow searching, I'm monitoring it and
> > > its doing
> > > about 57% cpu load when I'm searching , it takes about 50secs to
> > > bring up
> > > the results page the first time, then if I search for the same
> > thing
> > > again
> > > its much faster.
> > > Doug, can I trash my segments after they are indexed, I don't want
> > to
> > > have
> > > cached access to the pages do the segments still need to be there?
> > my
> > > 30mil
> > > page index/segment is using over 300gb I have the space, but when I
> > > get to
> > > the hundreds of millions of pages I will run out of room on my raid
> > > controler's for hd expansion, I'm planning on moving to lustre if
> > > ndfs is
> > > not stable by then. I plan on having a multi billion page index if
> > > the
> > > memory requirements for that can be below 16gb per search node.
> > right
> > > now
> > > I'm getting pretty crappy results from my 30 million pages, I read
> > > the
> > > whitepaper on Authoritative Sources in a Hyperlinked Environment
> > > because
> > > someone said thats how the nutch algorithm worked, so I'm assuming
> > as
> > > my
> > > index grows the pages that deserve top placement will recieve top
> > > placement,
> > > but I don't know if I should re-fetch a new set of segments with
> > root
> > > url's
> > > just ending in US extensions(.com.edu etc...) I made a small set
> > > testing
> > > this theory (100000 pages) and its results were much better than my
> > > results
> > > from the 30mill page index. whats your thought on this, am I right
> > in
> > > thinking that the pages with the most pages linking to them will
> > show
> > > up
> > > first? so if I index 500 million pages my results should be on par
> > > with the
> > > rest of the "big dogs"?
> > >
> > > one last important question, if I merge my indexes will it search
> > > faster
> > > than if I don't merge them, I currently have 20 directories of
> > > 1-1.7mill
> > > pages each.
> > > and if I split up these indexes across multiple machines will the
> > > searching
> > > be faster, I couldent get the nutch-server to work but I'm using
> > 0.6.
> > >
> > > I have a very fast server I didnt know if the searching would take
> > > advantage
> > > of smp, fetching will and I can run multiple index's at the same
> > > time. my HD
> > > array is 200MB a sec i/o I have the new dual core opteron 275 italy
> > > core
> > > with 4gb ram, working my way to 16gb when I need it and a second
> > > processor
> > > when I need it, 1.28TB of hd space for nutch currently with
> > expansion
> > > up to
> > > 5.12TB, I'm currently running windows 2000 on it as they havent
> > made
> > > a
> > > driver yet for suse 9.3 for my raid cards (highpoint 2220) so my
> > > scalability
> > > will be to 960MB a sec with all the drives in the system and 4x2.2
> > > Ghz
> > > processor cores. untill I need to cluster thats what I have to play
> > > with for
> > > nutch.
> > > in case you guys needed to know what hardware I'm running
> > > Thank you
> > > -Jay Pound
> > > Fromped.com
> > > BTW windows 2000 is not 100% stable with dual core processors.
> > nutch
> > > is ok
> > > but cant do too many things at once or I'll get a kernel inpage
> > error
> > > (guess
> > > its time to migrate to 2003.net server-damn)
> > > ----- Original Message ----- 
> > > From: "Doug Cutting" <[EMAIL PROTECTED]>
> > > To: <[email protected]>
> > > Sent: Tuesday, August 02, 2005 1:53 PM
> > > Subject: Re: Memory usage
> > >
> > >
> > > > Try the following settings in your nutch-site.xml:
> > > >
> > > > <property>
> > > >    <name>io.map.index.skip</name>
> > > >    <value>7</value>
> > > > </property>
> > > >
> > > > <property>
> > > >    <name>indexer.termIndexInterval</name>
> > > >    <value>1024</value>
> > > > </property>
> > > >
> > > > The first causes data files to use considerably less memory.
> > > >
> > > > The second affects index creation, so must be done before you
> > > create the
> > > > index you search.  It's okay if your segment indexes were created
> > > > without this, you can just (re-)merge indexes and the merged
> > index
> > > will
> > > > get the setting and use less memory when searching.
> > > >
> > > > Combining these two I have searched a 40+M page index on a
> > machine
> > > using
> > > > about 500MB of RAM.  That said, search times with such a large
> > > index are
> > > > not good.  At some point, as your collection grows, you will want
> > > to
> > > > merge multiple indexes containing different subsets of segments
> > and
> > > put
> > > > each on a separate box and search them with distributed search.
> > > >
> > > > Doug
> > > >
> > > > Jay Pound wrote:
> > > > > I'm testing an index of 30 million pages, it requires 1.5gb of
> > > ram to
> > > search
> > > > > using tomcat 5, I plan on having an index with multiple billion
> > > pages,
> > > but
> > > > > if this is to scale then even with 16GB of ram I wont be able
> > to
> > > have an
> > > > > index larger than 320million pages? how can I distribute the
> > > memory
> > > > > requirements across multiple machines, or is there another
> > > servlet
> > > program
> > > > > (like resin) that will require less memory to operate, has
> > anyone
> > > else
> > > run
> > > > > into this?
> > > > > Thanks,
> > > > > -Jay Pound
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > > -------------------------------------------------------
> > > SF.Net email is sponsored by: Discover Easy Linux Migration
> > > Strategies
> > > from IBM. Find simple to follow Roadmaps, straightforward articles,
> > > informative Webcasts and more! Get everything you need to get up to
> > > speed, fast.
> > http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> > > _______________________________________________
> > > Nutch-general mailing list
> > > [email protected]
> > > https://lists.sourceforge.net/lists/listinfo/nutch-general
> > >
> >
> >
>
>
>
>
>
>
>
>
___________________________________________________________________________
> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger
> Téléchargez cette version sur http://fr.messenger.yahoo.com
>
>


Reply via email to