[ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-1426: --------------------------------------- Attachment: LUCENE-1426.patch Attached patch. I think it's ready to commit... I'll wait a few days. This factors the writing of postings into separate Format* classes. The approach I took is similar to what I did for DocumentsWriter, where there is a hierarchical consumer interface (abstract class) for each of fields, terms, docs, and positions writing. Then there's a corresponding set of concrete classes (the "codec chain") that write today's index format. There is no change to the index format. Here are the details: * This only applies to postings (not stored fields, term vectors, norms, field infos) * Both SegmentMerger & FreqProxTermsWriter now use the same codec API to write postings. I think this is a big step forward: we now have a single set of classes that ever write the postings. * You can't yet customize this codec chain; we can add that at some point. It's all package private. * I don't yet allow the codec to override SegmentInfo.files(); at some point (when I first try to make a codec that uses different files) I will add this. I ran a quick performance test, indexing wikipedia, and found negligible performance cost of this. The next step, which is trickier, is to modularize/genericize the classes the read from the index, and then refactor SegmentTerm{Enum,Docs,Positions} to use that codec API. Then, finally, I want to make a codec that uses PFOR to encode postings. > Next steps towards flexible indexing > ------------------------------------ > > Key: LUCENE-1426 > URL: https://issues.apache.org/jira/browse/LUCENE-1426 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1426.patch > > > In working on LUCENE-1410 (PFOR compression) I tried to prototype > switching the postings files to use PFOR instead of vInts for > encoding. > But it quickly became difficult. EG we currently mux the skip data > into the .frq file, which messes up the int blocks. We inline > payloads with positions which would also mess up the int blocks. > Skipping offsets and TermInfo offsets hardwire the file pointers of > frq & prox files yet I need to change these to block + offset, etc. > Separately this thread also started up, on how to customize how Lucene > stores positional information in the index: > http://www.gossamer-threads.com/lists/lucene/java-user/66264 > So I decided to make a bit more progress towards "flexible indexing" > by first modularizing/isolating the classes that actually write the > index format. The idea is to capture the logic of each (terms, freq, > positions/payloads) into separate interfaces and switch the flushing > of a new segment as well as writing the segment during merging to use > the same APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]