[jira] [Commented] (LUCENE-4312) Index format to store position length per position
[ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892990#comment-16892990 ] Michael Gibney commented on LUCENE-4312: For the sake of facilitating discussion around something more concrete, I uploaded a patch ([^positionLength-postings.patch]) for a straw-man proposal for {{PostingsEnum}} modifications to support position length (also visible as a pseudo-PR here: [https://github.com/magibney/lucene-solr/pull/1]). The patch won't compile, of course (no corresponding modifications to subclasses of {{PostingsEnum}}). The proposal goes a bit beyond simply adding a {{positionLength()}} method, with a few additional fundamental methods to support optimizations that proved helpful in implementing performant positional queries (for LUCENE-7398). Any feedback would be much appreciated, especially given the acknowledged provisional (and potentially controversial) nature of this proposal. [~jpountz], I've given some more thought to the challenge of not being able to "advance positions on terms in the order we want anymore". I think there should be a general-purpose way to preserve this ability (in a way that doesn't depend on the kind of corpus-specific shingle-based filtering that I previously suggested). I'm considering an approach leveraging something analogous to a reverse token filter, except rather than reversing the token text, it (sort of) reverses start/end positions: start position of the new token is end position of the original token, and end position of the new token is {{originalEndPosition + positionLength}}. Then you could use the least-cost term as an entrypoint, and build forward with original tokens, backward with the modified-positions tokens. Query implementation would be responsible for properly interpreting flipped positions. > Index format to store position length per position > -- > > Key: LUCENE-4312 > URL: https://issues.apache.org/jira/browse/LUCENE-4312 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 6.0 >Reporter: Gang Luo >Priority: Minor > Labels: Suggestion > Attachments: positionLength-postings.patch > > Original Estimate: 72h > Remaining Estimate: 72h > > Mike Mccandless said:TokenStreams are actually graphs. > Indexer ignores PositionLengthAttribute.Need change the index format (and > Codec APIs) to store an additional int position length per position. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4312) Index format to store position length per position
[ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883103#comment-16883103 ] Michael Gibney commented on LUCENE-4312: That PR looks like a good start, [~romseygeek]. I'd still like to have a better sense of what bar we're trying to clear as far as demonstrating the usefulness of recording position length in the index. It seems we have consensus (basically unchanged since August of 2012), that: # Recording position length in the index is relatively straightforward, whether done via codecs and modified postings API or via Payloads and custom Intervals/Spans implementations. # Performant positional query implementation leveraging position length is a significant challenge. Would people be comfortable evaluating usefulness/performance based on the Spans implementation proposed for LUCENE-7398, as opposed to holding out for an Intervals query implementation? Particularly if we're talking proof-of-concept, all of the performance implications are directly comparable (between Spans and Intervals). Is there a _conditional_ consensus that indexed position length would be a worthwhile feature, assuming the implementation of performant positional queries to leverage it? > Index format to store position length per position > -- > > Key: LUCENE-4312 > URL: https://issues.apache.org/jira/browse/LUCENE-4312 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 6.0 >Reporter: Gang Luo >Priority: Minor > Labels: Suggestion > Original Estimate: 72h > Remaining Estimate: 72h > > Mike Mccandless said:TokenStreams are actually graphs. > Indexer ignores PositionLengthAttribute.Need change the index format (and > Codec APIs) to store an additional int position length per position. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4312) Index format to store position length per position
[ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882091#comment-16882091 ] Alan Woodward commented on LUCENE-4312: --- I've opened a PR with a simple implementation of this as an IntervalsSource: https://github.com/apache/lucene-solr/pull/772 Note that this currently works only with index-time graphs; currently the logic for rewriting query-time graphs relies on the fact that the only sources with a minimum extent of 1 are simple terms, and this new source breaks that assumption. I think we can fix that with the addition of another method to IntervalsSource, but I'm trying to keep things as simple as possible to begin with. > Index format to store position length per position > -- > > Key: LUCENE-4312 > URL: https://issues.apache.org/jira/browse/LUCENE-4312 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 6.0 >Reporter: Gang Luo >Priority: Minor > Labels: Suggestion > Original Estimate: 72h > Remaining Estimate: 72h > > Mike Mccandless said:TokenStreams are actually graphs. > Indexer ignores PositionLengthAttribute.Need change the index format (and > Codec APIs) to store an additional int position length per position. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4312) Index format to store position length per position
[ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881477#comment-16881477 ] Michael Gibney commented on LUCENE-4312: This sounds potentially like a good way to proceed. I appreciate the need for a high bar for getting things into the index – I suppose I was invoking "chicken/egg" not directly as an argument for inclusion in the index, but rather to highlight the interdependence of these features. Essentially all of the proof-of-concept work that we're discussing here is already implemented as part of the LUCENE-7398 work, and has been running in production (and iteratively improved) for over a year. Before proceeding, I'd like to get some consensus on what the best way is to move forward, and also perhaps have some discussion of what bar we have in mind for "once these prove useful". Regarding usefulness, and the question of to what extent this represents a corner-case: anybody interested in index-time synonyms and precise positional queries needs this feature. So in some sense this boils down to a question of the usefulness of index-time synonyms (or other index-time TokenStream graphs) ... and since the standing recommendation has for some time been to _avoid_ using index-time synonyms, we have another chicken/egg :). I can say that this has been considerably helpful in my use case, and the problem it addresses is at the root of a number of consistently reported issues, among users of both [Elasticsearch|https://discuss.elastic.co/t/not-getting-results-from-a-phrase-query-using-query-string-of-the-form-x-a1-abc-in-6-6-0/179191] and [Solr|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201905.mbox/%3CCAF%3DheHETPCqxUcqyu13tFfKFcALzD__-QrToRBP-VVWh1S3-Wg%40mail.gmail.com%3E]. Practically speaking, I'm wondering what's the best way to get the most eyes on this feature set, with the goal of evaluating its usefulness and performance. The fix as currently implemented is basically a wholesale rewrite of some of the Spans classes, but it seeks to correctly support existing Spans contracts; implemented as a branch, I was able to rely on existing tests against various Spans. For performance reasons, changes were also introduced in indexing code (e.g., DefaultIndexingChain). For these reasons, my sense is that it would be quite challenging to extract these features into the sandbox module. Even if such "sandbox" extraction were possible, it would render the task of evaluation more difficult for all but the most dedicated users (currently it suffices to run a forked build to swap out the backend Spans implementation in all of the parsers and components that rely on Spans). Could these potentially be reasons to opt for a "branch" approach (as opposed to "sandbox")? > Index format to store position length per position > -- > > Key: LUCENE-4312 > URL: https://issues.apache.org/jira/browse/LUCENE-4312 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 6.0 >Reporter: Gang Luo >Priority: Minor > Labels: Suggestion > Original Estimate: 72h > Remaining Estimate: 72h > > Mike Mccandless said:TokenStreams are actually graphs. > Indexer ignores PositionLengthAttribute.Need change the index format (and > Codec APIs) to store an additional int position length per position. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4312) Index format to store position length per position
[ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881195#comment-16881195 ] Michael McCandless commented on LUCENE-4312: +1 to build on payloads, today, to break the chicken/egg situation. This should just be a {{TokenFilter}} that converts {{PositionLengthAttribute}} into payloads? Then [~mgibney] could contribute query-time code that can decode these payloads and implement correct positional queries. Once these prove useful we can circle back later and optimize how we store position lengths in the index. > Index format to store position length per position > -- > > Key: LUCENE-4312 > URL: https://issues.apache.org/jira/browse/LUCENE-4312 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 6.0 >Reporter: Gang Luo >Priority: Minor > Labels: Suggestion > Original Estimate: 72h > Remaining Estimate: 72h > > Mike Mccandless said:TokenStreams are actually graphs. > Indexer ignores PositionLengthAttribute.Need change the index format (and > Codec APIs) to store an additional int position length per position. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4312) Index format to store position length per position
[ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880828#comment-16880828 ] Robert Muir commented on LUCENE-4312: - I don't think chicken and the egg description works well as an argument for something to add to the index. We should have a high bar in order to do that, because once something gets added its basically impossible to remove. My earlier suggestion (payloads) was based on the fact that we are talking about corner-cases as far as search improvements, at a heavy complexity cost. Maybe we could first address the search side with payload-based queries (maybe in sandbox, similar to what you already developed?) to try to address [~jpountz] concerns about scalability before actually optimizing it further by encoding in the index? This way it wouldn't have to be all solved at once. > Index format to store position length per position > -- > > Key: LUCENE-4312 > URL: https://issues.apache.org/jira/browse/LUCENE-4312 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 6.0 >Reporter: Gang Luo >Priority: Minor > Labels: Suggestion > Original Estimate: 72h > Remaining Estimate: 72h > > Mike Mccandless said:TokenStreams are actually graphs. > Indexer ignores PositionLengthAttribute.Need change the index format (and > Codec APIs) to store an additional int position length per position. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4312) Index format to store position length per position
[ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880635#comment-16880635 ] Michael Gibney commented on LUCENE-4312: True, both good points. But it's kind of a chicken-or-egg situation ... there would have been no point to address these implied challenges, so long as position length has not been recorded in the index (and is thus not available at query time). That doesn't mean there _aren't_ ways to address the challenges. Regarding the "A B C" example, I addressed this in the LUCENE-7398 work by indexing next start position as a lookahead. As a proof of concept this was done with Payloads, but in principle I could see slight modifications (somewhere at the intersection of codecs and postings API) that would natively read next start position "early" and expose it as a lookahead. This would avoid the type of problematic call to {{PostingsEnum.nextPosition()}} that would (as you correctly point out) result in the need to buffer all information associated with _every_ position. I've described this approach in more detail [here|https://michaelgibney.net/2018/09/lucene-graph-queries-2/#index-lookahead-don-t-buffer-positions-if-you-don-t-have-to]. {quote}we can't advance positions on terms in the order we want anymore. {quote} Yes, I'd argue that's the toughest challenge. I addressed it indirectly by constructing CommonGrams-style shingles used specifically for pre-filtering conjunctions in the "approximation" phase of two-phase iteration (ensuring that common terms at subclause index 0 don't kill performance). This is described in more detail [here|https://michaelgibney.net/2018/09/lucene-graph-queries-2/#shingle-based-pre-filtering-of-conjunctionspans]. I'm not intending this to be about these particular solutions, and you might take issue with the solutions themselves. The more general point I guess is that indexed position length is fundamental, and is a prerequisite for the development of ways to address these challenges. > Index format to store position length per position > -- > > Key: LUCENE-4312 > URL: https://issues.apache.org/jira/browse/LUCENE-4312 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 6.0 >Reporter: Gang Luo >Priority: Minor > Labels: Suggestion > Original Estimate: 72h > Remaining Estimate: 72h > > Mike Mccandless said:TokenStreams are actually graphs. > Indexer ignores PositionLengthAttribute.Need change the index format (and > Codec APIs) to store an additional int position length per position. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4312) Index format to store position length per position
[ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880606#comment-16880606 ] Adrien Grand commented on LUCENE-4312: -- bq. the complexity of query execution would be driven by what's actually in the index I don't think this is true. For instance an exact phrase query trying to match "A B C" that is currently positioned on A (position=3, length=1), B (position=4, length=1), C (position=6, length=1) would need to advance B to the next position in case there is another match on position 4 that has a length of 2. And then we should advance C first because maybe because it also has another match on position 4 of a different length. Also we can't advance positions on terms in the order we want anymore. Today we use the rarer term to lead the iteration of positions. If we had position lengths in the index we would need to advance positions in the order in which terms occur in the phrase query since the start position that B must have depends on the length of A on the current position: position starts are guaranteed to come in order in the index but position ends are not (at least we don't enforce it in token streams today). > Index format to store position length per position > -- > > Key: LUCENE-4312 > URL: https://issues.apache.org/jira/browse/LUCENE-4312 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 6.0 >Reporter: Gang Luo >Priority: Minor > Labels: Suggestion > Original Estimate: 72h > Remaining Estimate: 72h > > Mike Mccandless said:TokenStreams are actually graphs. > Indexer ignores PositionLengthAttribute.Need change the index format (and > Codec APIs) to store an additional int position length per position. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4312) Index format to store position length per position
[ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880490#comment-16880490 ] Michael Gibney commented on LUCENE-4312: Thank you for the feedback, [~sokolov] and [~jpountz]! {quote}Recording position lengths in the index is the easy part of the problem in my opinion. {quote} Yes, this is my view as well; and looking to the future, _respecting_ position length would certainly add complexity to phrase queries. But in terms of performance impact, the complexity of query execution would be driven by what's actually in the index (so for many use cases performance should be roughly equivalent to that of an implementation that ignores position length). Regarding the challenges of query implementation... I'm taking a fresh look at this issue in the context of work done on LUCENE-7398, which seeks to implement backtracking phrase queries in an efficient way (including sloppy, nested, etc.). Despite that issue being nominally about "nested Span queries", it's really more generally about "proximity search over variable-length subclauses", and the techniques used in the implementation for LUCENE-7398 would be transferable to interval queries as well. It's a fair point about the arbitrariness of sloppy phrase queries with intervening multi-term synonyms, but I wouldn't call such queries "meaningless"; in any case, I think that problem already exists for multi-term indexed synonyms, and is not exacerbated by the introduction of indexed position length. Sloppy phrase queries (and, for that matter, tokenization itself) are somewhat arbitrary by nature. Following that tangent, I can imagine some potential ways to mitigate such arbitrariness ... all of which themselves rely on the ability to index token graph structure (i.e., position length). > Index format to store position length per position > -- > > Key: LUCENE-4312 > URL: https://issues.apache.org/jira/browse/LUCENE-4312 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 6.0 >Reporter: Gang Luo >Priority: Minor > Labels: Suggestion > Original Estimate: 72h > Remaining Estimate: 72h > > Mike Mccandless said:TokenStreams are actually graphs. > Indexer ignores PositionLengthAttribute.Need change the index format (and > Codec APIs) to store an additional int position length per position. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4312) Index format to store position length per position
[ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880422#comment-16880422 ] Adrien Grand commented on LUCENE-4312: -- Recording position lengths in the index is the easy part of the problem in my opinion. I'm concerned that this will introduce significant complexity to phrase queries (they will require backtracking in order to deal with the case that a term exists twice at the same position with different position lengths), and even make sloppy phrase queries and their spans/intervals counterparts meaningless (as terms could be very distant according to the index only because there is one term in-between that has a multi-term synonym indexed). > Index format to store position length per position > -- > > Key: LUCENE-4312 > URL: https://issues.apache.org/jira/browse/LUCENE-4312 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 6.0 >Reporter: Gang Luo >Priority: Minor > Labels: Suggestion > Original Estimate: 72h > Remaining Estimate: 72h > > Mike Mccandless said:TokenStreams are actually graphs. > Indexer ignores PositionLengthAttribute.Need change the index format (and > Codec APIs) to store an additional int position length per position. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4312) Index format to store position length per position
[ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879757#comment-16879757 ] Mike Sokolov commented on LUCENE-4312: -- Yes, we're compromising precision today when we apply index-time synonyms (and other analysis that produces token graphs). I think this would be an awesome addition If the cost of indexing position length in postings is not too great, and I think you are right -- it will usually be 1, sometimes 0 and rarely > 1, so we should be able to encode compactly, and decode efficiently. > Index format to store position length per position > -- > > Key: LUCENE-4312 > URL: https://issues.apache.org/jira/browse/LUCENE-4312 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 6.0 >Reporter: Gang Luo >Priority: Minor > Labels: Suggestion > Original Estimate: 72h > Remaining Estimate: 72h > > Mike Mccandless said:TokenStreams are actually graphs. > Indexer ignores PositionLengthAttribute.Need change the index format (and > Codec APIs) to store an additional int position length per position. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4312) Index format to store position length per position
[ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879425#comment-16879425 ] Michael Gibney commented on LUCENE-4312: Following up on discussion at Berlin Buzzwords with [~mikemccand], [~sokolov], [~simonw], and [~romseygeek]: A lot of useful context (for, e.g., synonym generation, etc.) is available at index time that is not available at query time. Leveraging this context can result in index-time TokenStream manipulations that produce token graphs. Since position length is not indexed, it is impossible at query time to reconstruct index-time TokenStream "graph" structure. Indexed position length is a prerequisite for any use case that calls for: 1. index-time graph TokenStreams 2. precise/accurate proximity query (via spans, intervals, etc.) Could we discuss adding first-class support for this structural "position length" information? Updating PostingsEnum to include endPosition() -- returning {{position+1}} by default -- would be a meaningful first step. This would facilitate the development of query implementations without requiring an API fork, and would signal an intention to move in the direction of supporting index-time token graphs. Beyond that, I'm optimistic that codecs could be enhanced to index position length without introducing much additional overhead (I'd guess that position length for the common case of linear/non-graph index-time token streams could compress quite well). > Index format to store position length per position > -- > > Key: LUCENE-4312 > URL: https://issues.apache.org/jira/browse/LUCENE-4312 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 6.0 >Reporter: Gang Luo >Priority: Minor > Labels: Suggestion > Original Estimate: 72h > Remaining Estimate: 72h > > Mike Mccandless said:TokenStreams are actually graphs. > Indexer ignores PositionLengthAttribute.Need change the index format (and > Codec APIs) to store an additional int position length per position. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4312) Index format to store position length per position
[ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829599#comment-16829599 ] Michael Gibney commented on LUCENE-4312: (following on discussion at