Matthias,

minMergeDocs is what controls how many Documents will be held in memory
before being flushed to the disk, and mergeFactor controls how often
segments are merged.  Both values are 10 by default in Lucene, I
believe.
If you have Lucene in Action, this is described in more detail there.
If you have Lucene in Action code, you can play with these parameters
on a small index, to see how things behave without having to wait for
your large 7M document index.  You can also look for my name on
onjava.com where one of the Lucene articles I wrote for O'Reilly
Network describes various Lucene indexing parameters and also includes
a small demo class for them.

Otis


--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Matthias Jaekle wrote:
> >> 050721 071234 * Optimizing index...
> >> ... this takes a long time ...
> > 
> > 
> > Hello,
> > 
> > optimizing the index takes extremly long.
> > I have the feeling in earlier versions, this was much faster.
> > 
> > I just try to index a 7.000.000 Pages Segment.
> > This is running till 10 days now.
> > Processing was starting with around 280 pages / second then falls
> down 
> > over the next 3 days to 4 pages / second.
> > 
> > Now it is optimizing or maybe writing the index since 6 days and I
> have 
> > no idea how long it will take.
> > 
> > In the index dir I have now around 1.000.000 files, together around
> 15 GB.
> > 
> > Any ideas how to figure out, how long this process will run?
> > 
> > How could I speed up this process?
> > 
> > Is the system normaly that slow or do I have something
> missconfigured?
> 
> This is related to how Lucene manages its indexes during incremental 
> updates. You can change the default parameters in nutch-default.xml
> (or 
> actually, nutch-site.xml) - look for indexer.mergeFactor,
> minMergeDocs 
> and maxMergeDocs. If you know Lucene well, you could hack a bit and
> try 
> to test a scenario where updates are batched using RAMDirectory, and 
> only above certain threshold they are merged with on-disk indexes.
> 
> A bit of background: the exact performance depends on a lot of
> factors, 
> but the key ones are related to how well your disk subsystem copes
> with 
> managing many small files. This depends on the filesystem and the raw
> 
> disk I/O. Some filesystems don't like directories with millions of 
> files, and incur significant performance penalty. Some disk
> subsystems 
> are good with burstable traffic (because of large cache) but quite
> bad 
> with sustained traffic.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration
> Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> 

Reply via email to