> From: Scott Ganyo [mailto:[EMAIL PROTECTED]]
> 
> We're having a heck of a time with too many file handles 
> around here.  When
> we create large indexes, we often get thousands of temporary 
> files in a given index!

Thousands, eh?  That seems high.

The maximum number of segments should be f*log_f(N), where f is the
IndexWriter.mergeFactor and N is the number of documents.  The default merge
factor is ten.  There are seven files per segment, plus one per field.  If
we assume that you have three fields per document, then its ten files per
segment.  So to get 1000 files in an index with three fields and a
mergeFactor of ten, you'd need 10 billion documents, which I doubt you have.
(Lucene can't handle more than 2 billion anyway...)

How many fields do you have?  (How many different .f files are there per
segment?)

Have you lowered IndexWriter.maxMergeDocs?  If you, e.g. lowered this to
10,000, then with a million documents you'd have 100 segments, which would
give you 1000 files.  So, to minimize the number of files, keep maxMergeDocs
at Integer.MAX_VALUE, its default.

Another possibility is that you're running on Win32 and obsolete files are
being kept open by IndexReaders and cannot be deleted.  Could that be the
case?

> Even worse, we just plain run out of file 
> handles--even on
> boxes where we've upped the limits as much as we think we 
> can!

You should endevour to keep just one IndexReader at a time for an index.
When it is out of date, don't close it, as this could break queries running
in other threads, just let it get garbage collected.  The finalizers will
close things and free the file handles.

> I'm not very familiar with the Lucene file system yet, so can someone
> briefly explain how Lucene works on creating an index?  How does it
> determine when to create a new temporary file in the index 
> and when does it
> decide to compress the index?

Assume mergeFactor is ten, the default.  A new segment is created on disk
for every ten documents added, or sooner if IndexWriter.close() is called
before ten have been added.  When the tenth segment of size ten is added,
all ten are merged into a single segment of size 100.  When ten such
segments of size 100 have been added, these are merged into a single segment
containing 1000 documents, and so on.  So at any time there can be no more
than nine segments in each power-of-ten index size.  When optimize() is
called all segments are merged into a single segment.

The exception is that no segments will be created larger than
IndexWriter.maxMergeDocs.  So if this were set to 1000, then when you add
the 10,000th document, instead of merging things into a single segment of
10,000, it would add a tenth segment of size 1000, and keep adding segments
of size 1000 for every 1000 documents added.

> Also, is there any way we 
> could limit the
> number of file handles used by Lucene?

An IndexReader keeps all files in all segments open while it is open.  So to
minimize the number of file handles you should minimize the number of
segments, minimize the number of fields, and minimize the number of
IndexReaders open at once.

An IndexWriter also has all files in all segments open at once.  So updating
in a separate process would also buy you more file handles.

Doug

Reply via email to