Hi Andrzej,

thanks for your response. I am not really familar with the lucene internals.

I am just running nutch with the default parameters on a debian sarge system with ext3 file system, maximum 1024 files opened, and 1 GB RAM.

So is ext3 a bad file system for millions of files?

I could not change the file system in the moment. So I think I should change the parameters.

Which values would you suggest for
* indexer.mergeFactor?
* indexer.minMergeDocs?
* indexer.maxMergeDocs?
* indexer.termIndexInterval?

Many thanks for your support

Matthias



This is related to how Lucene manages its indexes during incremental updates. You can change the default parameters in nutch-default.xml (or actually, nutch-site.xml) - look for indexer.mergeFactor, minMergeDocs and maxMergeDocs. If you know Lucene well, you could hack a bit and try to test a scenario where updates are batched using RAMDirectory, and only above certain threshold they are merged with on-disk indexes.

A bit of background: the exact performance depends on a lot of factors, but the key ones are related to how well your disk subsystem copes with managing many small files. This depends on the filesystem and the raw disk I/O. Some filesystems don't like directories with millions of files, and incur significant performance penalty. Some disk subsystems are good with burstable traffic (because of large cache) but quite bad with sustained traffic.

Reply via email to