Re: DbDirectory and compound files

Andi Vajda Thu, 30 Sep 2004 09:34:08 -0700

The purpose of the compound file implementation is to minimize the number of open files that an IndexReader must keep open. Instead of 7 + the number of indexed fields files per segement, only a single file must be kept open per segement. This helps applications which keep lots of unoptimized indexes open. (It also, and this is more common, helps folks who open a new IndexReader for each query and don't close it. In this case, opening fewer files gives the garbage collector time to close files before the process runs into its file descriptor limit, inducing a flurry of but reports about "too many open files".)

Does that make any more sense?

Yes, thanks for the explanation. This confirms that the compound file implementation is not that useful when used in conjunction with the DbDirectory implementation since the only open OS files are the ones opened by Berkeley DB, ie, the two db files + some log files if transactions are used. The number of OS files open is more or less constant, is controlled by the Berkeley DB environment and is independant of the number of IndexWriter instances open. This thinking would also apply to RAMDirectory. No files are open at all in that case, right ?

These changes are back-compatible: the old classes and methods are still there and interoperate with the new but are deprecated. You might wait until there is a Lucene release with the new API in it before you update DbDirectory. To move to the new API, all that should be required is changing your subclass of InputStream to instead subclass BufferedIndexInput, and also change your subclass of IndexOutput to instead subclass BufferedIndexOutput. You'll also need to add a length() method to your BufferedIndexInput subclass, instead of setting a protected length field in the constructor. That's it.


Cool, that should be easy enough.

The revision of the API was primarily to make buffering optional. We could have left the buffered implementation names the same, but then the classes would be named poorly and it also seemed like an opportunity to remove the name clash with java.io.

This point about buffering brings up another point. Currently, there is no public way to tell the open IndexWriter to flush its Directory. This makes it difficult to use several transactions during the lifetime of the IndexWriter. For example, it would be good if after each indexing operation, the Berkely DB transaction could be committed. For that to work though, the DbDirectory buffers have to be flushed first. There is no public API available at the moment to tell the IndexWriter to make this happen. It seems that you're saying that this situation is improved with the new index IO classes since buffering was made optional ?

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DbDirectory and compound files

Reply via email to