[ http://issues.apache.org/jira/browse/LUCENE-196?page=all ] Otis Gospodnetic resolved LUCENE-196: -------------------------------------
Resolution: Duplicate Assign To: (was: Lucene Developers) Thanks Christian. I think LUCENE-545 provided the solution to selective field loading now. > [PATCH] Added support for segmented field data files and cached directories > --------------------------------------------------------------------------- > > Key: LUCENE-196 > URL: http://issues.apache.org/jira/browse/LUCENE-196 > Project: Lucene - Java > Type: Improvement > Components: Index > Versions: CVS Nightly - Specify date in submission > Environment: Operating System: All > Platform: All > Reporter: Christian Kohlschütter > Priority: Minor > Attachments: docStore-patch.txt, docStore-test-patch.txt, > docStore-test-patch.txt, docStore-test-patch.txt, newDocStore-patch.txt, > newDocStore-test-patch.txt > > Hello, > > I would like to contribute the following enhancement, hoping that it would be > as useful for you as it is for me. > > For one of my applications, it was necessary to reprocess the Documents > returned by a search in a Lucene index according to some Field values (for > applying an "edit distance" function on unindexed fields, in my case). > > Because Lucene has to load every possibly relevant document (*all* fields, > including the ones which are irrelevant for the algorithm) from disk into > memory for this operation - doing so is extensively time-consuming. > > As far as I can see, currently, there is no satisfying solution to improve > this situation except buffering all data in RAM using a RAMDirectory. > > But what if the field data is just too big to fit in RAM? > > My patch will handle this by splitting the monolithic "*.fdt"-Field data file > into several "data store" files .fdt, .fd1, .fd2 and so on. > > These "data store" files are connected as a linked-list which permits you to > load only the part of the field data that is relevant for the current > operation. > > So, you can load all field data (as in the current implementation), or the > fields from a specific interval [0;n] of data stores. Store 0 represents the > data in the ".fdt" file, all data stores with ids > 0 are represented by > files > ".fd1", ".fd2", and so on. > > In my case, I would then simply cache the ".fdt" (data store 0) file in RAM > (using a symbolic link to shm-/tmp), but leave all other .fd* files on > harddisk. The .fdt file only contains the relevant field for my algorithm > (which therefore remains quite small); all the other fields are stored in the > rather big ".fd0" file. So, accessing Fields in .fdt requires no disk I/O, > which speeds up things remarkably. > > You can compare this feature with having multiple tables in a relational > database that are linked with 1..1 cardinality instead of having one big > table. > > My proposed enhancement requires some API additions, which I try to explain > now. > > To specify the desired data store for a Field, simply call the new method > "Field setDataStore(int)" (docstore 0 is the default): > doc.add(Field.Keyword("fieldA", "this is in docstore 0")); > doc.add(Field.Keyword("fieldB", "this is in docstore 1").setDataStore(1)); > > In this example, fieldA would be stored in ".fdt"; fieldB in ".fd1". > > When you retrieve the Document object (example docId = 123) using an > IndexReader, you have the following options: > "indexReader.document(123)" would load all fields from all data stores. > "indexReader.document(123, 0)" would load only the fields from data store 0. > "indexReader.document(123, 1)" would explictly load only the fields from data > stores 0 and 1. > > The method "IndexReader.document(int n, int k)" is defined to fetch all > fields > from all data stores *at least* up to ID k. That way, existing IndexReader > subclasses do not have to be modified, as I provide an overridable method in > IndexReader which simply calls document(int n). > > A more concrete example is attached to this feature request as a > JUnit-Testcase, as well as the patch itself. > > Have fun with it! > > > Best regards, > > Christian Kohlschuetter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]