Matthias Jaekle wrote:
050721 071234 * Optimizing index...
... this takes a long time ...
Hello,
optimizing the index takes extremly long.
I have the feeling in earlier versions, this was much faster.
I just try to index a 7.000.000 Pages Segment.
This is running till 10 days now.
Processing was starting with around 280 pages / second then falls down
over the next 3 days to 4 pages / second.
Now it is optimizing or maybe writing the index since 6 days and I have
no idea how long it will take.
In the index dir I have now around 1.000.000 files, together around 15 GB.
Any ideas how to figure out, how long this process will run?
How could I speed up this process?
Is the system normaly that slow or do I have something missconfigured?
This is related to how Lucene manages its indexes during incremental
updates. You can change the default parameters in nutch-default.xml (or
actually, nutch-site.xml) - look for indexer.mergeFactor, minMergeDocs
and maxMergeDocs. If you know Lucene well, you could hack a bit and try
to test a scenario where updates are batched using RAMDirectory, and
only above certain threshold they are merged with on-disk indexes.
A bit of background: the exact performance depends on a lot of factors,
but the key ones are related to how well your disk subsystem copes with
managing many small files. This depends on the filesystem and the raw
disk I/O. Some filesystems don't like directories with millions of
files, and incur significant performance penalty. Some disk subsystems
are good with burstable traffic (because of large cache) but quite bad
with sustained traffic.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com