DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://issues.apache.org/bugzilla/show_bug.cgi?id=27743>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=27743 [PATCH] Added support for segmented field data files and cached directories ------- Additional Comments From [EMAIL PROTECTED] 2004-03-18 21:07 ------- Doug, thanks for your reply. I think that I should explain some background of this patch. The main reason for writing this patch was to provide support for applying functions on field values that are independent on an upstream index but dependent on the entered query. In my application, I do use an index (access through TermEnum/TermDocs) to reduce the number of returned documents K to a fraction of all documents N. The returned set (probably multiple terms per document) needs to be reprocessed against the entered query (which may consist of multiple terms as well). After reprocessing, the resulting set R is much smaller than the set of the initially returned documents K (|R| << |K|), whereas R is a subset of K. This procedere can be compared to something like this in the SQL world: SELECT TextValue FROM table1 WHERE IndexValue = "FOOBAR" AND DISTANCE_FUNCTION(TextValue, "Query String") < 0.4 There would be an index-based solution if DISTANCE_FUNCTION had only dependencies on stored columns ("functional indexes", as in PostgreSQL), but in this case, I see no other way than applying some function on every returned document (something like O(k)) Unfortunately, my initial dataset (the monolithic .FDT file) was far too big (gigabytes) to fit into a RAMDirectory. So seeking and reading from harddisk must be included in the calculations. So, I came up with the idea of "partitioning" the field data file: Partition ("dataStore") 0 would be small enough to fit into RAM (having no seek time when skipping from one document to another one). Partition 1 will only fit on my slow harddisk, but from this partition, I only need the data belonging to the documents in R, not of all in K. That way, I am still in linear costs, but without seeking (and this makes an remarkably speedup in search time when you, for example, have to look at 200,000 entries K of 5,000,000 entries N, just to get some 100 entries R as the final result, and this per user - with lots of simultaenous requests). Perhaps something like this could also be implemented using the new TermVector support somehow, but I have not thought about it in detail, yet. Regarding compression issues, I would say that there would be no benefit of compressing the field data values, as they are not traversed sequentially. However, another application for "partitioned" field data files would simply be the shared storage of document information. You could have partition 0 on a RAMDirectory, partition 1 on a local FSDirectory and partition 2 on an NFS-mounted FSDirectory, for example. Partition would then only be accessed if necessary (if the user wants more detailed information about a document - for example the original HTML document along with all the other information as in Google Cache). Currently, I would store that data outside lucene and have a filename or URI as a field value pointing to the data ("CLOB"). With the new feature, a simple indexReader.document(docNo, 2).get("FieldName") would be enough. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]