RE: Indexing problem

Doug Cutting Fri, 02 Nov 2001 08:30:05 -0800

> From: Daryl Thachuk [mailto:[EMAIL PROTECTED]]
> 
> A question I'd like answered is, why do I now have to be 
> concerned about 
> having too many files open when before I didn't? What has changed to 
> cause this? This sounds like a bug to me.


Sigh.

IndexReader now keeps all files that are not read entirely into memory open
as long as the IndexReader is open.  This was to fix the bug where another
thread or process, while updating the index, would delete files that an open
index reader might need.  So there are now a few more files kept open per
segment, making it easier to run out of file handles.  IndexWriter uses
IndexReader internally, so the number of open files while indexing has also
increased.

In particular, there are five files, plus one per field, kept open per
segment.  While indexing, a maximum of IndexWriter.MergeFactor+1 segments
are ever open at once.  So a million document, three field index with
IndexWriter.MergeFactor=10, would have a maximum of 88 files open at a time
while indexing.

Note however, that an IndexReader must keep all segments open.  The maximum
number of segments in an index is (k - 1) * ( log_k(N) - 1), where k is the
IndexWriter.mergeFactor and N is the number of documents.  So an index with
a million documents could have up to 45 segments (on average it will have
22.5).  With three fields, an unoptimized IndexReader would require a
maximum of 360 open files.  Once optimized to a single segment, it would
require only 8 open files.

In practice, this should not be a problem.  Have you raised
IndexWriter.mergeFactor?  If so, try lowering it to the default, 10.  Are
you also opening IndexReaders in the same process?  If so, keep just one per
index, shared by all search threads, and, if possible, only open a new one
when the index has just been optimized.  Ideally, document additions should
be batched, and finished by a call to optimize().  Not only do optimized
indexes have fewer files open, but they're must faster to search.

Strictly speaking, since there is only supposed to be a single writer for an
index at a time, IndexWriter does not need to keep files open except when it
is using them.  So the number of file handles used while indexing could be
reduced if IndexWriter were permitted to open IndexReaders in a special
private mode, where files are opened on demand and closed prompty.  That
said, this might permit you to more easily create an index that you cannot
read!

On the upside, at search time, each query used to open a file per term (two
files per phrase term) per segment.  So big queries, or lots of concurrent
small ones, used to run out of file handles.  This is no longer the case.
IndexReader now opens every file once and only once.  Now it just keeps most
of them open...

Doug


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

RE: Indexing problem

Reply via email to