[ https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-5879: --------------------------------------- Attachment: LUCENE-5879.patch Initial work-in-progress patch: tests do NOT consistently pass, there are still bugs/corner cases (e.g. when an auto-prefix term is the last term in a block...). This change very much requires LUCENE-5268 (pull API for postings format) which I'd like to backport to 4.x along with this: it makes an initial pass through all terms to identify "good" prefixes, using the same algorithm block tree uses to assign terms to blocks, just with different block sizes. I haven't picked defaults yet, but e.g. you could state that an auto prefix term should expand to 100-200 terms and then the first pass picks prefix terms to handle that. The problem is inherently over-constrained: a given set of prefixes like fooa*, foob*, fooc*, etc. may have too-few terms each, but then their common prefix foo* would have way too many. For this case it creates "floored" prefix terms, e.g. foo\[a-e\]\*, foo\[f-p\]\*, foo\[q-z\]\*. On the 2nd pass, when it writes the actual terms, it inserts these auto-prefix terms at the right places. Currently it only works for DOCS_ONLY fields, and it uses a FixedBitSet(maxDoc) when writing each prefix term. These auto-prefix terms are fully hidden from all the normal Terms/Enum APIs, statistics, etc. They are only used in Terms.intersect, if you pass a new flag allowing them to be used. I haven't done anything about the document / searching side of things: this is just a low level change at this point, for the terms dict. Maybe we need a new FieldType boolean "computeAutoPrefixTerms" or some such; currently it's just exposed as additional params to the block tree terms dict writer. I think this would mean NumericRangeQuery/Filter can just rewrite to ordinary TermRangeQuery/Filter, and the numeric fields just become sugar for encoding their numeric values as sortable binary terms. > Add auto-prefix terms to block tree terms dict > ---------------------------------------------- > > Key: LUCENE-5879 > URL: https://issues.apache.org/jira/browse/LUCENE-5879 > Project: Lucene - Core > Issue Type: New Feature > Components: core/codecs > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 5.0, 4.10 > > Attachments: LUCENE-5879.patch > > > This cool idea to generalize numeric/trie fields came from Adrien: > Today, when we index a numeric field (LongField, etc.) we pre-compute > (via NumericTokenStream) outside of indexer/codec which prefix terms > should be indexed. > But this can be inefficient: you set a static precisionStep, and > always add those prefix terms regardless of how the terms in the field > are actually distributed. Yet typically in real world applications > the terms have a non-random distribution. > So, it should be better if instead the terms dict decides where it > makes sense to insert prefix terms, based on how dense the terms are > in each region of term space. > This way we can speed up query time for both term (e.g. infix > suggester) and numeric ranges, and it should let us use less index > space and get faster range queries. > > This would also mean that min/maxTerm for a numeric field would now be > correct, vs today where the externally computed prefix terms are > placed after the full precision terms, causing hairy code like > NumericUtils.getMaxInt/Long. So optos like LUCENE-5860 become > feasible. > The terms dict can also do tricks not possible if you must live on top > of its APIs, e.g. to handle the adversary/over-constrained case when a > given prefix has too many terms following it but finer prefixes > have too few (what block tree calls "floor term blocks"). -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org