[jira] [Updated] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Michael McCandless (JIRA) Wed, 18 Mar 2015 08:56:53 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-5879:
---------------------------------------
    Attachment: LUCENE-5879.patch

I think for now we should simply commit auto-prefix as a dark
feature (my last patch, which I hate!) ... progress not perfection.

I.e. it's available only as an optional, disabled by default, postings
format, and if you want to use it in your app you must use PerFieldPF
to enable it for certain fields.  You'll have to figure out how to
index byte[] tokens, etc.

I modernized the last patch (tons of stuff had changed) and carried
over a fix for an issue I hit in LUCENE-6005.  Patch applies to trunk,
and bumps block tree's index format.

With auto-prefix, numeric fields can simply be indexed as their
obvious byte[] encoding and you just use TermRangeQuery at search
time; TestAutoPrefix shows this.

There are maybe some problems with the patch: is it horrible to store
CompiledAutomaton on PrefixQuery/TermRangeQuery because of query
caching...?  Should I instead recompute it for every segment in
.getTermsEnum?  Or store it on the weight?  Hmm or in a shared
attribute, like FuzzyQuery (what a hack)?

It allocates one FixedBitSet(maxDoc) at write time, per segment, to
hold all docs matching each auto-prefix term ... maybe that's too
costly?  I could switch to more sparse impls (roaring, sparse,
BitDocIdSet.Builder?) but I suspect typically we will require fairly
dense bitsets anyway for the short prefixes.  We end up OR'ing many
terms together at write time...

I created a FixedBitPostingsEnum, FixedBitTermsEnum, both package
private under oal.index, so I can send the bit set to PostingsConsumer
at write time.  Maybe there's a cleaner way?

Maybe the changes should be moved to lucene/misc or lucene/codecs, not
core?  But this would mean yet another fork of block tree...

It only works for IndexOptions.DOCS fields; I think that's fine?

The added auto-prefix terms are not seen by normal postings APIs, they
do not affect index stats, etc.  They only kick in, as an
implementation detail, when you call Terms.intersect(Automaton).  The
returned TermsEnum.term() can return an auto-prefix term, but
LUCENE-5938 improves this since we now use
MultiTermQueryConstantScoreWrapper by default.


> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Reply via email to