1. Make the index format extensible by adding user-implementable reader
and writer interfaces for postings.
...
Here's a very rough, sketchy, first draft of a type (1) proposal.

Nice!

In approach 1, what is the best abstraction of a flexible index format
for Lucene?

The draft proposal seems to suggest the following (roughly):
 A dictionary entry is <Term, FilePointer>.
 A posting entry for a term in a document is <Doc, PostingContent>.
Classes which implement PostingFormat decide the format of PostingContent.

Storing all the posting content, e.g. frequencies and positions, in a
single file greatly simplifies things. However, this could cause some
performance penalty. For example, boolean query 'Apache AND Lucene'
would have to paw through positions. But position indexing for Apache
and Lucene is necessary to support phrase query '"Apache Lucene"'.

Is it a good idea to allow PostingFormat to decide whether and how to
store posting content in multiple files?
 A dictionary entry is <Term, <FilePointer>+>.
 A posting entry for a term in a document is <Doc, <PostingContent>+>.
Each PostingContent is stored in a separate file.

Or is a two-file abstraction good enough? It supports all formats in
approaches 2 and 3.
 A dictionary entry is <Term, FreqPointer, ProxPointer>.
 A posting entry for a term in a document is <Doc,
PerDocPostingContent, <Position, PerPositionPostingContent>+>.
Doc and PerDocPostingContent are stored in a .frq file.
Position and PerPositionPostingContent are stored in a .prx file.

What Michael called Payload can be viewed as PerPositionPostingContent here.


I'm not sure this is the best approach: it's just the first one that
comes to my mind.  Perhaps instead Tokens should have a list of aspects,
each of which implement a TokenAspect interface, or somesuch.

Making Token have a list of aspects would work. A particular analyzer
would add certain types of aspects to the tokens it emits. For
example, one analyzer adds a TextEmphasis aspect to a token. Another
analyzer adds a PartOfSpeech aspect to the same token. A particular
posting implementation would expect certain types of aspects. For
example, one may require a TextEmphasis aspect and a PartOfSpeech
aspect. The posting implementation generates posting content (payload)
by encoding the values of both aspects.


Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to