[
https://issues.apache.org/jira/browse/LUCENE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404574#comment-16404574
]
Alan Woodward commented on LUCENE-8196:
---------------------------------------
I moved the approximation and reset() from IntervalIterator, and it turned out
that ConjunctionIntervalIterator was a convenient place to put them, so I kept
that in. I also cleaned up the utility classes (no need to check for
TwoPhaseIterator or BitSetIterator when we know that the leaves are always
PostingsEnum) and moved to org.apache.lucene.search.intervals.
Payload matches can be added trivially at a later point. Payload scoring will
be a more interesting one, but I'm not entirely happy with the way scoring
works at the moment anyway, I need to read around a bit more on good ways of
scoring proximity queries in general. Offsets may not be trivial to add, as we
have the same problem we originally had with Spans here, in that some of the
algorithms advance leaf intervals before returning. Something to consider
later on, definitely.
The PQ specialization is just lifted directly from DisjunctionDISIApproximation
in the existing core search package, so it wasn't my conclusion :)
re extractTerms() - I don't like this method being here in the first place
really, see my comment about scoring above. I think let's keep it simple for
now, you can always filter afterwards.
[~jim.ferenczi] - where can I remove null checks? I think I'm constrained by
having to return [-1..-1] when iterator has moved to a new doc but hasn't been
advanced over intervals yet.
The disjunction order is to address LUCENE-7398. If a disjunction is appearing
within a block, then sorting for minimum intervals can miss valid matches. The
Vigna paper doesn't seem to discuss this case anywhere. One possible solution
that would work in all cases would be to add a boolean to
IntervalsSource.getIntervals() that indicates whether or not the source is in a
final position - if it is, then it should return minimal intervals, otherwise
it should return wider ones. This would only be applicable to disjunctions.
> Add IntervalQuery and IntervalsSource to expose minimum interval semantics
> across term fields
> ---------------------------------------------------------------------------------------------
>
> Key: LUCENE-8196
> URL: https://issues.apache.org/jira/browse/LUCENE-8196
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Alan Woodward
> Assignee: Alan Woodward
> Priority: Major
> Attachments: LUCENE-8196.patch, LUCENE-8196.patch, LUCENE-8196.patch,
> LUCENE-8196.patch
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> This ticket proposes an alternative implementation of the SpanQuery family
> that uses minimum-interval semantics from
> [http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf]
> to implement positional queries across term-based fields. Rather than using
> TermQueries to construct the interval operators, as in LUCENE-2878 or the
> current Spans implementation, we instead use a new IntervalsSource object,
> which will produce IntervalIterators over a particular segment and field.
> These are constructed using various static helper methods, and can then be
> passed to a new IntervalQuery which will return documents that contain one or
> more intervals so defined.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]