Instead of a Collector, why isn't this a TwoPhaseIterator with a high matchCost?
~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Thu, May 6, 2021 at 6:43 PM Michael Sokolov <[email protected]> wrote: > Thanks Adrien, that is something like what I had in mind. If you are > able to share, that could be very helpful. And -- deleted docs is not > something I had considered, it's possibly a problem here. I'd have to > go check - I think these "filter" Queries were implemented in the > second part of the two-phase iteration. > > On Thu, May 6, 2021 at 4:24 PM Adrien Grand <[email protected]> wrote: > > > > We have something like that in Elasticsearch that wraps queries in order > to be able to report cost, matchCost and the number of calls to > nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in > the query tree. > > > > It's not perfect as it needs to disable some optimizations in order to > work properly. For instance bulk scorers are disabled and conjunctions are > not inlined, which means that clauses may run in a different order. So > results need to be interpreted carefully as the way the query gets executed > when observed may differ a bit from how it gets executed normally. That > said it has still been useful in a number of cases. I don't think our > implementation works when IndexSearcher is configured with an executor but > we could maybe put it in sandbox and iterate from there? > > > > For your case, do you think it could be attributed to deleted docs? > Deleted docs are checked before two-phase confirmation and collectors but > after disjunctions/conjunctions of postings. > > > > Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <[email protected]> a > écrit : > >> > >> Do we have a way to understand how BooleanQuery (and other composite > >> queries) are advancing their child queries? For example, a simple > >> conjunction of two queries advances the more restrictive (lower > >> cost()) query first, enabling the more costly query to skip over more > >> documents. But we may not be making the best choice in every case, and > >> I would like to know, for some query, how we are doing. For example, > >> we could execute in a debugging mode, interposing something that wraps > >> or observes the Scorers in some way, gathering statistics about how > >> many documents are visited by each Scorer, which can be aggregated for > >> later analysis. > >> > >> This is motivated by a use case we have in which we currently > >> post-filter our query results in a custom collector using some filters > >> that we know to be expensive (they must be evaluated on every > >> document), but we would rather express these post-filters as Queries > >> and have them advanced during the main Query execution. However when > >> we tried to do that, we saw some slowdowns (in spite of marking these > >> Queries as high-cost) and I suspect it is due to the iteration order, > >> but I'm not sure how to debug. > >> > >> Suggestions welcome! > >> > >> -Mike > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
