Dmitry,

It would be cleaner if this could be done entirely as a Directory implementation. I know some folks who've implemented a filesystem-within-a-file solution for this problem that they're very happy with. It is a Directory, and requires no changes to Lucene. I'll ask them if they're willing to contribute it, so that others can use it.

Doug

Dmitry Serebrennikov wrote:
Greetings, Luceeners!

Looks like lot's of good stuff is happenning with the code as of late. It's great to see this momentum!
Here's some more action coming your way...


---------------------------------------------------------
We all love Lucene, but most would agree that it tends to use a very large number of file handles.
This is especially true for applications that have one or more of the factors below:
a) use a merged index over a large number of indexes
b) experience index updates concurrent with searching
c) search through unoptimized indexes
d) use high merge factor settings to speed up indexing
e) have a large number of indexed fields
For a long time I've been contemplating an idea that can help drastically reduce the number of file handles needed by Lucene. Now I am finally going to get a few days to make this happen (pending final approval by the powers that be). So, I wanted to put out the general plan of action and seek community comment on it early on. Over the next day or so, I intend to implement the changes outlined below (unless of course I get responses that steer me in a different direction). As I get more solid results (down to the diffs), I'll post them for further review. But the sooner I get feedback, the more chance there is that I will actually be able to incorporate it. Hopefully, this will result in a set of patches that will solve the problems I am after, and be useful enough to the general Lucene population to be included into the tree.


So, here goes.

Lucene's indexes are built out of segments. Each segment consists of a number of files, which are written when the segment is created during indexing. Once the IndexWriter is closed, the segment files are not modified, ever (except the file that lists deleted documents, if any). The proposed change is as follows:
- add code to IndexWriter.close() method to combine all of the segment's files into a single file with a header that indicates start offset and a length for each of the new file's components, corresponding to individual files in the current segments. This will be done in such a way that the file will be able to contain any number of components - this way we can support evolution of the segment structure in the future. The deleted documents file will remain separate.
- add a new segment reader, or add code to the existing one, to work with these types of segments
- when this new segment reader opens its files, it can open one file object from the Directory for the combined file and then clone it for each of the files formerly in the segment. Each cloned file object would maintain its own position into the combined file and will have its own buffer as they did before. They will also need to know the starting offset and a length of their fragment of the combined file.


Some questions to solicit feedback:
*) I don't know all of the classes that will need changes yet, but I think this can be accomplished with moderate effort in the index and maybe the store packages. Does this seem reasonable?
*) I can't see any adverse effects of this change except possibly one. Since less OS file handles will be used, the way OS caching is applied to Lucene indexes will change. I know that Lucene relies on OS-level file caching for a good part of its performance magic, but I lack the right experience to know what effect the proposed change will have on the performance. There should be the same number of disk accesses overall, but obviously there will be concentrated in a single file and will be more spread out. The disk should not really thrash any more than before, since the same data will be read in the same order, just now it will be in a single file rather than in different files. However, if OS file caching is optimal only when a given file handle experiences sequential reads, this can be a problem. Can anyone shed some light on what we can expect with this change? I am primarially interested in Solaris and Windows (NT/2000) at this time, but I'd like to know of possible impact on other OSes as well.
*) Given the above, is this a wothwhile idea? If not, can we modify it so as to limit the performance impact?


Thanks for your consideration and feedback.
Dmitry.



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to