I have found that merging indexes does help performance significantly.

If you're not using the cached pages for anything, I believe you can
delete the /content directory for each segment and the engine should
work fine (test before you try for real!)  However, if you ever have
to reindex the segments for whatever reason, you'll run into problems
without the /content dirs.

Nutch doesn't use the HITS algorithm.  Nutch's analyze phase was based
off of PageRank, but it's no longer supported.  By default Nutch
boosts documents based on the # of incoming links, which works well in
small document collections, but is not a robust method in a whole-web
environment.  In terms of search quality, Nutch would not be able to
hang with the "big dogs" of search just yet.  There's still much work
that needs to be done in the area of search quality and spamming.

Andy

On 8/2/05, Fredrik Andersson <[EMAIL PROTECTED]> wrote:
> Hi Jay!
> 
> Why not use the "Google approach" and buy lots of cheap
> workstations/servers to distribute the search on? You can really get
> away cheap these days, compared to high-end servers. Even if NDFS and
> isn't fully up to par in 0.7-dev yet, you can still move your indices
> around to separate computers and distribute them that way.  Writing a
> small client/server for this purpose can be done in a matter of hours.
> Gathering as much data as you have on one server sounds like a bad
> idea to me, no matter how monstrous that server is.
> 
> Regarding the HITS algorithm - check out the example on the Nutch
> website for the Internet crawl, where you select the top scorers after
> you finished a segment (of arbitrary size), and continue on crawling
> from those high-ranking sites. That way you will get the most
> authorative sites in your index first, which is good.
> 
> Good night,
> Fredrik
> 
> On 8/2/05, Jay Pound <[EMAIL PROTECTED]> wrote:
> > ....
> > one last important question, if I merge my indexes will it search faster
> > than if I don't merge them, I currently have 20 directories of 1-1.7mill
> > pages each.
> > and if I split up these indexes across multiple machines will the searching
> > be faster, I couldent get the nutch-server to work but I'm using 0.6.
> > ...
> > Thank you
> > -Jay Pound
> > Fromped.com
> > BTW windows 2000 is not 100% stable with dual core processors. nutch is ok
> > but cant do too many things at once or I'll get a kernel inpage error (guess
> > its time to migrate to 2003.net server-damn)
> > ----- Original Message -----
> > From: "Doug Cutting" <[EMAIL PROTECTED]>
> > To: <[email protected]>
> > Sent: Tuesday, August 02, 2005 1:53 PM
> > Subject: Re: Memory usage
> >
> >
> > > Try the following settings in your nutch-site.xml:
> > >
> > > <property>
> > >    <name>io.map.index.skip</name>
> > >    <value>7</value>
> > > </property>
> > >
> > > <property>
> > >    <name>indexer.termIndexInterval</name>
> > >    <value>1024</value>
> > > </property>
> > >
> > > The first causes data files to use considerably less memory.
> > >
> > > The second affects index creation, so must be done before you create the
> > > index you search.  It's okay if your segment indexes were created
> > > without this, you can just (re-)merge indexes and the merged index will
> > > get the setting and use less memory when searching.
> > >
> > > Combining these two I have searched a 40+M page index on a machine using
> > > about 500MB of RAM.  That said, search times with such a large index are
> > > not good.  At some point, as your collection grows, you will want to
> > > merge multiple indexes containing different subsets of segments and put
> > > each on a separate box and search them with distributed search.
> > >
> > > Doug
> > >
> > > Jay Pound wrote:
> > > > I'm testing an index of 30 million pages, it requires 1.5gb of ram to
> > search
> > > > using tomcat 5, I plan on having an index with multiple billion pages,
> > but
> > > > if this is to scale then even with 16GB of ram I wont be able to have an
> > > > index larger than 320million pages? how can I distribute the memory
> > > > requirements across multiple machines, or is there another servlet
> > program
> > > > (like resin) that will require less memory to operate, has anyone else
> > run
> > > > into this?
> > > > Thanks,
> > > > -Jay Pound
> > > >
> > > >
> > >
> > >
> >
> >
> >
>


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to