Matthias Jaekle wrote:
050721 071234 * Optimizing index...
... this takes a long time ...


Hello,

optimizing the index takes extremly long.
I have the feeling in earlier versions, this was much faster.

I just try to index a 7.000.000 Pages Segment.
This is running till 10 days now.
Processing was starting with around 280 pages / second then falls down over the next 3 days to 4 pages / second.

Now it is optimizing or maybe writing the index since 6 days and I have no idea how long it will take.

In the index dir I have now around 1.000.000 files, together around 15 GB.

Any ideas how to figure out, how long this process will run?

How could I speed up this process?

Is the system normaly that slow or do I have something missconfigured?

This is related to how Lucene manages its indexes during incremental updates. You can change the default parameters in nutch-default.xml (or actually, nutch-site.xml) - look for indexer.mergeFactor, minMergeDocs and maxMergeDocs. If you know Lucene well, you could hack a bit and try to test a scenario where updates are batched using RAMDirectory, and only above certain threshold they are merged with on-disk indexes.

A bit of background: the exact performance depends on a lot of factors, but the key ones are related to how well your disk subsystem copes with managing many small files. This depends on the filesystem and the raw disk I/O. Some filesystems don't like directories with millions of files, and incur significant performance penalty. Some disk subsystems are good with burstable traffic (because of large cache) but quite bad with sustained traffic.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to