1. Make the index format extensible by adding user-implementable reader and writer interfaces for postings. ... Here's a very rough, sketchy, first draft of a type (1) proposal.
Nice! In approach 1, what is the best abstraction of a flexible index format for Lucene? The draft proposal seems to suggest the following (roughly): A dictionary entry is <Term, FilePointer>. A posting entry for a term in a document is <Doc, PostingContent>. Classes which implement PostingFormat decide the format of PostingContent. Storing all the posting content, e.g. frequencies and positions, in a single file greatly simplifies things. However, this could cause some performance penalty. For example, boolean query 'Apache AND Lucene' would have to paw through positions. But position indexing for Apache and Lucene is necessary to support phrase query '"Apache Lucene"'. Is it a good idea to allow PostingFormat to decide whether and how to store posting content in multiple files? A dictionary entry is <Term, <FilePointer>+>. A posting entry for a term in a document is <Doc, <PostingContent>+>. Each PostingContent is stored in a separate file. Or is a two-file abstraction good enough? It supports all formats in approaches 2 and 3. A dictionary entry is <Term, FreqPointer, ProxPointer>. A posting entry for a term in a document is <Doc, PerDocPostingContent, <Position, PerPositionPostingContent>+>. Doc and PerDocPostingContent are stored in a .frq file. Position and PerPositionPostingContent are stored in a .prx file. What Michael called Payload can be viewed as PerPositionPostingContent here.
I'm not sure this is the best approach: it's just the first one that comes to my mind. Perhaps instead Tokens should have a list of aspects, each of which implement a TokenAspect interface, or somesuch.
Making Token have a list of aspects would work. A particular analyzer would add certain types of aspects to the tokens it emits. For example, one analyzer adds a TextEmphasis aspect to a token. Another analyzer adds a PartOfSpeech aspect to the same token. A particular posting implementation would expect certain types of aspects. For example, one may require a TextEmphasis aspect and a PartOfSpeech aspect. The posting implementation generates posting content (payload) by encoding the values of both aspects. Ning --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]