Thanks for the clarification Greg. I've been looking into this recently and
filed https://issues.apache.org/jira/browse/LUCENE-9938 based on a hunch
that these DocIdSetIterator.all(maxDoc) iterators have a
non-negligible cost inside ConjunctionDISI.  Ultimately I closed the issue
because the TPI design seems to prohibit removing them  :-(.  Feel free to
comment there nonetheless if you have any thoughts on the matter.  For my
part, I have some benchmarking to do in Solr for a related matter that
would move certain queries that work at the collector stage to be TPIs
-- SOLR-14164.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Fri, May 7, 2021 at 6:06 PM Greg Miller <[email protected]> wrote:

> Just chiming in here to answer David's question since I have some
> familiarity:
>
> In this specific case, the logic was implemented inside a Collector
> and we tried to move it into a Query abstraction using a
> TwoPhaseIterator with a high matchCost. The first-phase would match on
> all docs (essentially: DocIdSetIterator.all(reader.maxDoc())) and the
> second phase would do the costly check. The matchCost was advertised
> as reader.maxDoc(). ("reader" in this example is from the
> LeafReaderContext).
>
> Moving the logic behind a Query abstraction caused performance
> regressions. So one theory is that it was somehow leading iteration
> with an expensive "match all docs" DISI, but we don't actually know if
> that's true right now.
>
> Cheers,
> -Greg
>
> On Fri, May 7, 2021 at 8:41 AM David Smiley <[email protected]> wrote:
> >
> > Instead of a Collector, why isn't this a TwoPhaseIterator with a high
> matchCost?
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Thu, May 6, 2021 at 6:43 PM Michael Sokolov <[email protected]>
> wrote:
> >>
> >> Thanks Adrien, that is something like what I had in mind. If you are
> >> able to share, that could be very helpful. And -- deleted docs is not
> >> something I had considered, it's possibly a problem here. I'd have to
> >> go check - I think these "filter" Queries were implemented in the
> >> second part of the two-phase iteration.
> >>
> >> On Thu, May 6, 2021 at 4:24 PM Adrien Grand <[email protected]> wrote:
> >> >
> >> > We have something like that in Elasticsearch that wraps queries in
> order to be able to report cost, matchCost and the number of calls to
> nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in
> the query tree.
> >> >
> >> > It's not perfect as it needs to disable some optimizations in order
> to work properly. For instance bulk scorers are disabled and conjunctions
> are not inlined, which means that clauses may run in a different order. So
> results need to be interpreted carefully as the way the query gets executed
> when observed may differ a bit from how it gets executed normally. That
> said it has still been useful in a number of cases. I don't think our
> implementation works when IndexSearcher is configured with an executor but
> we could maybe put it in sandbox and iterate from there?
> >> >
> >> > For your case, do you think it could be attributed to deleted docs?
> Deleted docs are checked before two-phase confirmation and collectors but
> after disjunctions/conjunctions of postings.
> >> >
> >> > Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <[email protected]> a
> écrit :
> >> >>
> >> >> Do we have a way to understand how BooleanQuery (and other composite
> >> >> queries) are advancing their child queries? For example, a simple
> >> >> conjunction of two queries advances the more restrictive (lower
> >> >> cost()) query first, enabling the more costly query to skip over more
> >> >> documents. But we may not be making the best choice in every case,
> and
> >> >> I would like to know, for some query, how we are doing. For example,
> >> >> we could execute in a debugging mode, interposing something that
> wraps
> >> >> or observes the Scorers in some way, gathering statistics about how
> >> >> many documents are visited by each Scorer, which can be aggregated
> for
> >> >> later analysis.
> >> >>
> >> >> This is motivated by a use case we have in which we currently
> >> >> post-filter our query results in a custom collector using some
> filters
> >> >> that we know to be expensive (they must be evaluated on every
> >> >> document), but we would rather express these post-filters as Queries
> >> >> and have them advanced during the main Query execution. However when
> >> >> we tried to do that, we saw some slowdowns (in spite of marking these
> >> >> Queries as high-cost) and I suspect it is due to the iteration order,
> >> >> but I'm not sure how to debug.
> >> >>
> >> >> Suggestions welcome!
> >> >>
> >> >> -Mike
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: [email protected]
> >> >> For additional commands, e-mail: [email protected]
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to