Hi Andrzej,
thanks for your response. I am not really familar with the lucene internals.
I am just running nutch with the default parameters on a debian sarge
system with ext3 file system, maximum 1024 files opened, and 1 GB RAM.
So is ext3 a bad file system for millions of files?
I could not change the file system in the moment. So I think I should
change the parameters.
Which values would you suggest for
* indexer.mergeFactor?
* indexer.minMergeDocs?
* indexer.maxMergeDocs?
* indexer.termIndexInterval?
Many thanks for your support
Matthias
This is related to how Lucene manages its indexes during incremental
updates. You can change the default parameters in nutch-default.xml (or
actually, nutch-site.xml) - look for indexer.mergeFactor, minMergeDocs
and maxMergeDocs. If you know Lucene well, you could hack a bit and try
to test a scenario where updates are batched using RAMDirectory, and
only above certain threshold they are merged with on-disk indexes.
A bit of background: the exact performance depends on a lot of factors,
but the key ones are related to how well your disk subsystem copes with
managing many small files. This depends on the filesystem and the raw
disk I/O. Some filesystems don't like directories with millions of
files, and incur significant performance penalty. Some disk subsystems
are good with burstable traffic (because of large cache) but quite bad
with sustained traffic.