DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=27743>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=27743

Added support for segmented field data files

           Summary: Added support for segmented field data files
           Product: Lucene
           Version: CVS Nightly - Specify date in submission
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: Enhancement
          Priority: Other
         Component: Index
        AssignedTo: [EMAIL PROTECTED]
        ReportedBy: [EMAIL PROTECTED]


Hello, 
 
I would like to contribute the following enhancement, hoping that it would be 
as useful for you as it is for me. 
 
For one of my applications, it was necessary to reprocess the Documents 
returned by a search in a Lucene index according to some Field values (for 
applying an "edit distance" function on unindexed fields, in my case). 
 
Because Lucene has to load every possibly relevant document (*all* fields, 
including the ones which are irrelevant for the algorithm) from disk into 
memory for this operation - doing so is extensively time-consuming. 
 
As far as I can see, currently, there is no satisfying solution to improve 
this situation except buffering all data in RAM using a RAMDirectory. 
 
But what if the field data is just too big to fit in RAM? 
 
My patch will handle this by splitting the monolithic "*.fdt"-Field data file 
into several "data store" files .fdt, .fd1, .fd2 and so on. 
 
These "data store" files are connected as a linked-list which permits you to 
load only the part of the field data that is relevant for the current 
operation. 
 
So, you can load all field data (as in the current implementation), or the 
fields from a specific interval [0;n] of data stores. Store 0 represents the 
data in the ".fdt" file, all data stores with ids > 0 are represented by files 
".fd1", ".fd2", and so on. 
 
In my case, I would then simply cache the ".fdt" (data store 0) file in RAM 
(using a symbolic link to shm-/tmp), but leave all other .fd* files on 
harddisk. The .fdt file only contains the relevant field for my algorithm 
(which therefore remains quite small); all the other fields are stored in the 
rather big ".fd0" file. So, accessing Fields in .fdt requires no disk I/O, 
which speeds up things remarkably. 
 
You can compare this feature with having multiple tables in a relational 
database that are linked with 1..1 cardinality instead of having one big 
table. 
 
My proposed enhancement requires some API additions, which I try to explain 
now. 
 
To specify the desired data store for a Field, simply call the new method 
"Field setDataStore(int)" (docstore 0 is the default): 
doc.add(Field.Keyword("fieldA", "this is in docstore 0")); 
doc.add(Field.Keyword("fieldB", "this is in docstore 1").setDataStore(1)); 
 
In this example, fieldA would be stored in ".fdt"; fieldB in ".fd1". 
 
When you retrieve the Document object (example docId = 123) using an 
IndexReader, you have the following options: 
"indexReader.document(123)" would load all fields from all data stores. 
"indexReader.document(123, 0)" would load only the fields from data store 0. 
"indexReader.document(123, 1)" would explictly load only the fields from data 
stores 0 and 1. 
 
The method "IndexReader.document(int n, int k)" is defined to fetch all fields 
from all data stores *at least* up to ID k. That way, existing IndexReader 
subclasses do not have to be modified, as I provide an overridable method in 
IndexReader which simply calls document(int n). 
 
A more concrete example is attached to this feature request as a 
JUnit-Testcase, as well as the patch itself. 
 
Have fun with it! 
 
 
Best regards, 
 
Christian Kohlschuetter

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to