[jira] [Commented] (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180717#comment-14180717 ] Paul Elschot commented on LUCENE-1252: -- Thinking out loud: How about adding a method toDocWithAllRequiredTerms(), and a method toPositionMatch() that may move to a later doc? These two together should do what nextDoc() does now. Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Core Issue Type: Wish Components: core/search Reporter: Paul Elschot Priority: Minor Labels: gsoc2014 In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178136#comment-14178136 ] Michael McCandless commented on LUCENE-1252: bq. I was fairly specific about how to tackle the problem... I dont mean to be rude, but did you actually read my proposal? Yes, you were specific, and yes, I did in fact read the proposal, several times, and as I said, I like it, but it's a complex problem that I've also spent time thinking about and so I gave some simple feedback about an alternative to one part of the proposal: It just seems strange to me, when Lucene already has this notion of boiling down complex queries into their low-level concrete equivalent queries (i.e. rewrite) to then add a separate API (two-phase execution) for doing what looks to me to be essentially the same thing. Two-phase execution is just rewriting to a two-clause conjunction where the cost of each clause are wildly different, opening up the freedom for BQ to execute more efficiently. And, yes, like rewrite, I think it should recurse. I also think BQ should handle the costly nextDoc filters (i.e. do them last) as first class citizens, and it seemed like the proposal suggested otherwise (maybe I misunderstood this part of the proposal). Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Core Issue Type: Wish Components: core/search Reporter: Paul Elschot Priority: Minor Labels: gsoc2014 In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178374#comment-14178374 ] Robert Muir commented on LUCENE-1252: - Both of those are covered. Yes its proposed as a first-class citizen (on DISI), more first-class than your second-class rewrite() solution. It might be good to already look at what BQ and co do with query execution today. the fact is, rewrite() is inappropriate. the main reason rewrite() is second-class here is because scoring is *per-segment* in lucene. In trunk today this means: * filters might get executed with a completely different plan for different segments (conjunction versus Bits-acceptDocs). * scorers might get returned as null for some segments (e.g. term doesnt exist). BQ doesn't have special null handling in every subscorer that it uses, that would be horrible. Instead it eliminates such scorers immediately in BooleanWeight's constructor. * this like the above can impact the structure of what BQ does (e.g. A or B, if B is null, it just returns A instead of a DisjunctionScorer). * scorers have a cost() api, which tells us additional critical information for execution. I've proposed expanding this to much more (additional flags/methods) so that BQ can be a quarterback and not the entire football team. another reason rewrite() is wrong here, is that such two-phase execution is only useful for conjunction (or: conjunction-like queries such as MinShouldMatch). Otherwise, its useless. Rewriting to two queries when the query is not a conjunction will not help performance. at the low level, I can't see a rewrite() solution being efficient. Today ExactPhraseScorer is already coded as: {code} while (iterate approximation...) { confirm(); // phraseFreq 0 } {code} so in the case, its already ready to be exposed for two-phase execution. On the other hand with rewrite(), it would either be 2 separate DPenums, or a mess? Finally, by having this generic on DISI, this approximation becomes much more flexible. For example a postings format could implement it for its docsenums, if it is able to implement a cheap approximation... perhaps via skiplist-like data, or perhaps with an explicit approximate cache mechanism specifically for this purpose. Another example is a filter like a geographic distance filter, could return a bounding box here automatically rather than forcing the user to deal with this themselves with complex query/filter hacks. At the same time, such a filter could signal that it has a linear time nextDoc, and booleanquery can really do the best execution. A rewrite() solution really makes this hard, because the two things would then be completely separate queries. But if you look at all the cases of executing this logic, its very useful for BQ to know that A is a superset of B. Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Core Issue Type: Wish Components: core/search Reporter: Paul Elschot Priority: Minor Labels: gsoc2014 In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176689#comment-14176689 ] Michael McCandless commented on LUCENE-1252: +1 to this plan. Except if possible, I think high cost filters really should be first class in Lucene? I think the two-phase execution is in fact the same problem? I.e., a PhraseQuery should just rewrite itself into a conjunction of two clauses: the pure conjunction of the terms, which is fast to execute, AND'd with a high-cost filter (that loads positions and checks that the terms are actually a phrase). If BQ knew how to juggle such costly clauses then this issue (AND'ing a PhraseQuery with other simple clauses) and other example like geo filtering (which could rewrite into a cheap bbox AND'd w/ slow true distance check) should work well. But if somehow making high cost filters first class is too much, then I agree: let's punt, here. It just seems to me that it's the same issue as two-phase execution... Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Core Issue Type: Wish Components: core/search Reporter: Paul Elschot Priority: Minor Labels: gsoc2014 In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176790#comment-14176790 ] Robert Muir commented on LUCENE-1252: - I was fairly specific about how to tackle the problem... I dont mean to be rude, but did you actually read my proposal? Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Core Issue Type: Wish Components: core/search Reporter: Paul Elschot Priority: Minor Labels: gsoc2014 In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176451#comment-14176451 ] Robert Muir commented on LUCENE-1252: - I think we should keep the issue open, I know I've been thinking about this one a lot lately. The thing I see is, for it to work really nicely, BooleanQuery really needs to own execution of both queries and filters. some kind of blurry proposal/plan like this: Execute filters by BooleanQuery instead of its mini-me (FilteredQuery), e.g. as an additional type of BooleanClause. Merge Filter and Weight, in some way that makes sense, e.g. maybe just make Weight.scorer(LeafReaderContext context, Bits acceptDocs) a covariant-return override of Filter.getDocIdSet(LeafReaderContext context, Bits acceptDocs). Make sure any wrappers like ConstantScore delegate any new APIs correctly. Add bulk methods like and/or/not to Filter such that optimized impls like FixedBitSet.and() can be used. Since java 7u40 these ones get autovectorized by hotspot and are a valid strategy. I think maybe some of these could be optimized by sparse bitset impls as well. Create an enhanced cost metric/execution API for filters. BooleanQuery needs this additional context to give the most efficient execution. At the least, it should have the information to know to do the bulk optos above, and even apply deletes this way if its appropriate (in lucene 5 deleted docs are a FixedBitSet). I would also want a way to indicate that a Filter has a linear-time nextDoc(). these cases (e.g. filtering by exact geographic distance) are horrible to support, but handling them correctly (e.g. in a final phase) is a lesser evil than having the API be crazy so that systems like solr/es can do them with hacks. Remove stuff like FilteredQuery, BooleanFilter, etc. Fix LUCENE-3331 (or impl in some other way), such that scores are not needed is passed down the query execution stack. The tricky part is BQ's execution plan is currently in two places really, rewrite() and Weight.scorer(). And I really think it needs the freedom to be able to completely restructure queries for performance (across nested BQ as well). Another option is to setup internal infra so BooleanWeight.scorer() can do this, as it have cost() knowledge too, but it feels so wrong. Finally, we should add some support for two-phase execution via DISI.getSuperSet() or some other approximation. ConjunctionScorer could both use (when at least one sub supports) and implement this method (when e.g. coord scoring prevents optimal restructing and its nested) for faster AND/filtering of phrase/sloppy/spans/whatever, or for any other custom query/filter that supports a fast approximation. Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Core Issue Type: Wish Components: core/search Reporter: Paul Elschot Priority: Minor Labels: gsoc2014 In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867094#action_12867094 ] Paul Elschot commented on LUCENE-1252: -- LUCENE-2410 has solved this partially for PhraseQuery/PhraseScorer by computing only the first matching phrase to determine a possible match, and by delaying the computation of the remaining matches until score() is called. Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Paul Elschot Priority: Minor In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12713918#action_12713918 ] Shai Erera commented on LUCENE-1252: I must admit that I read the issue briefly, and so my proposal below may be irrelevant. But it sounds to me like for the above example we'd want to use ConjunctionScorer to first iterate on both Scorers (PhraseScorer?) until they agree, and only then call Collector.collect(), which will call Scorer.score() in return (if a scoring collector is used). Alternatively, we can implement a FilterCollector (if there isn't one already) which impls its collect(int doc) by first determining if a document is a match, and then call a given Collector, which will proceed with collecting and scoring. Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Paul Elschot Priority: Minor In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12713928#action_12713928 ] Michael McCandless commented on LUCENE-1252: I agree, we would want to do a ConjunctionScorer first, w/ all 4 terms, and then 2nd a PhraseScorer for each of the two phrases, but somehow they should be bound together such that a single TermPositions enumerator is shared in the two places for each term. I think doing the additional filtering in collect is a little late -- there could be a number of such more expensive constraints to apply, depending on the query. But, eg since we're talking about how to fix the up-front sort logic in ConjunctionScorer... you could imagine asking the PhraseQuery for its scorer, and getting back 2 AND'd scorers (cheap expensive) that under-the-hood are sharing a single TermPositions enum, and then ConjunctionScorer would order all such scorers it got so that all cheap ones are checked first and only once they agree on a doc are the expensive scorers check. Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Paul Elschot Priority: Minor In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704711#action_12704711 ] Michael McCandless commented on LUCENE-1252: Here's a simple example that might drive this issue forward: +h1n1 flu +united states Ideally, to score this query, you'd want to first AND all 4 terms together, and only for docs matching that, consult the positions of each pair of terms. But we fail to do this today. It's like somehow Weight.scorer() needs to be able to return a cheap and an expensive scorer (which must be AND'd). I think PhraseQuery would somehow return cheap/expensive scorers that under the hood share the same SegmentTermDocs/Positions iterators, such that after cheap.next() has run, cheap.expensive only needs to check the current doc. So in fact maybe the expensive scorer should not be a Scorer but some other simple passes or doesn't API. Or maybe it returns say a TwoStageScorer, which adds a reallyPasses() (needs better name) method to otherwise normal the Scorer (DISI) API. Or something else Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Paul Elschot Priority: Minor In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702520#action_12702520 ] Michael McCandless commented on LUCENE-1252: Right, I think this is more about determining whether a doc is a hit or not, than about how to compute its score. I think somehow the scorer needs to return 2 scorers that share the underlying iterators. The first scorer simply checks AND-ness with all other required terms, and only if the doc passes those are the positional/payloads/anything-else-expensive consulted. Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Paul Elschot Priority: Minor In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701957#action_12701957 ] Paul Elschot commented on LUCENE-1252: -- There is no patch for now. HitCollectors should not be affected by this, as they would only be involved when a real match is found, and that, when position info is needed, necessarily involves the positions. Extending this with a cheap score brings another issue: should a cheap score be given for a document that might match, but in the end does not really match when positions are used? At the moment, I don't think so: score values are normally cheap to compute, but accessing positions is not cheap. Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Paul Elschot Priority: Minor In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701726#action_12701726 ] Jason Rutherglen commented on LUCENE-1252: -- When flexible indexing goes in, users will be able to put data into the index that allow scorers to calculate a cheap score, collect, then go through and calculate a presumably more expensive score. Would it be good to implement this patch with this sort of more general framework in mind? It seems like this could affect the HitCollector API as we'd want a more generic way of representing scores than the primitive float we assume now. Aren't we rewriting the HitCollector APIs right now? Can we implement this change now? Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Paul Elschot Priority: Minor In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org