On 10/02/10 09:47, Uwe Schindler wrote:
Positions as attributes would be good. For positions we need a new Attribute
(not PositionIncrement), but e.g. for offsets and payloads we can use the
standard attributes from the analysis, which is really cool. This would also
make it possible to add all custom attributes from the analysis phase to the
posting list and make them visible in the TermDocs enum. In my opinion, there
should be no DocsEnum, DocsAndPositionsEnum and so on enums, just one class,
which only differes in provided attributes. So if you want the payloads, ask
for a standard DocsEnum and pass the requested attribute classes as parameter):
IndexReader.termDocsEnum(Bits skipDocs, String field, BytesRef term,
Class<? extends Attribute>... atts)
If somebody wants offsets and payloads:
reader.termDocsEnum(skipDocs, "field", term, OffsetAttribute.class,
PayloadAttribute.class);
I kind of like this idea. This interface to iterate over the postings
looks more flexible, and imho it will be easy to use this interface with
any "home-brewed" codec.
Read optimisations based on the user need such as the current
termDocsEnum and termPositionsEnum (where one is reading only the freq
file, the second one is also reading the prox file) will be done under
the hood by the respective PostingReader. Given the set of Attribute
class received, the PostingReader knows what he needs to read, and what
he does not need to read. So, there is also a simplification of the
interface for the user. It does not have to take care of choosing the
right enum.
I am not sure if this is very good in Lucene as it would break lots of apps.
E.g. simple autocompletes use a PrefixTerm(s)Enums, but must use the top-level
reader or they have to emulate merging of all TermsEnums themselves. A second
problem (currently) is rewrites (e.g. Fuzzy) to BooleanQuery for MTQs. They
operate on the top level reader.
So I propose "simple" and not so performant Enums for MultiReaders. In my
opinion, it would also be possible without ProxyAttributes, if we simply copy them
around. It’s a performance problem, but if somebody needs speed, segment-level enums
should be used (and search does this by the way).
Could you provide pointers to search code that uses the segment-level
enum ?
As I explained in my last answer to Michael, the TermScorer is using the
DocsEnum interface, and therefore do not know if it manipulates
segment-level enum or a Multi*Enums. What search (or query operators) in
Lucene is using segment-level enums ?
Cheers
--
Renaud Delbru
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org