DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://issues.apache.org/bugzilla/show_bug.cgi?id=27743>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=27743 Added support for segmented field data files Summary: Added support for segmented field data files Product: Lucene Version: CVS Nightly - Specify date in submission Platform: All OS/Version: All Status: NEW Severity: Enhancement Priority: Other Component: Index AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] Hello, I would like to contribute the following enhancement, hoping that it would be as useful for you as it is for me. For one of my applications, it was necessary to reprocess the Documents returned by a search in a Lucene index according to some Field values (for applying an "edit distance" function on unindexed fields, in my case). Because Lucene has to load every possibly relevant document (*all* fields, including the ones which are irrelevant for the algorithm) from disk into memory for this operation - doing so is extensively time-consuming. As far as I can see, currently, there is no satisfying solution to improve this situation except buffering all data in RAM using a RAMDirectory. But what if the field data is just too big to fit in RAM? My patch will handle this by splitting the monolithic "*.fdt"-Field data file into several "data store" files .fdt, .fd1, .fd2 and so on. These "data store" files are connected as a linked-list which permits you to load only the part of the field data that is relevant for the current operation. So, you can load all field data (as in the current implementation), or the fields from a specific interval [0;n] of data stores. Store 0 represents the data in the ".fdt" file, all data stores with ids > 0 are represented by files ".fd1", ".fd2", and so on. In my case, I would then simply cache the ".fdt" (data store 0) file in RAM (using a symbolic link to shm-/tmp), but leave all other .fd* files on harddisk. The .fdt file only contains the relevant field for my algorithm (which therefore remains quite small); all the other fields are stored in the rather big ".fd0" file. So, accessing Fields in .fdt requires no disk I/O, which speeds up things remarkably. You can compare this feature with having multiple tables in a relational database that are linked with 1..1 cardinality instead of having one big table. My proposed enhancement requires some API additions, which I try to explain now. To specify the desired data store for a Field, simply call the new method "Field setDataStore(int)" (docstore 0 is the default): doc.add(Field.Keyword("fieldA", "this is in docstore 0")); doc.add(Field.Keyword("fieldB", "this is in docstore 1").setDataStore(1)); In this example, fieldA would be stored in ".fdt"; fieldB in ".fd1". When you retrieve the Document object (example docId = 123) using an IndexReader, you have the following options: "indexReader.document(123)" would load all fields from all data stores. "indexReader.document(123, 0)" would load only the fields from data store 0. "indexReader.document(123, 1)" would explictly load only the fields from data stores 0 and 1. The method "IndexReader.document(int n, int k)" is defined to fetch all fields from all data stores *at least* up to ID k. That way, existing IndexReader subclasses do not have to be modified, as I provide an overridable method in IndexReader which simply calls document(int n). A more concrete example is attached to this feature request as a JUnit-Testcase, as well as the patch itself. Have fun with it! Best regards, Christian Kohlschuetter --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]