Re: Computing multiple different aggregations over a match-set in one pass

Stefan Vodita Thu, 16 Feb 2023 13:32:12 -0800

Hi Greg,

To better understand how much work gets duplicated, I went ahead
and modified FloatTaxonomyFacets as an example [1]. It doesn't look
too pretty, but it illustrates how I think multiple aggregations in one
iteration could work.


Overall, you're right, there's not as much wasted work as I had
expected. I'll try to do a performance comparison to quantify precisely
how much time we could save, just in case.

Thank you the help!
Stefan

[1] 
https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684

On Wed, 15 Feb 2023 at 15:48, Greg Miller <[email protected]> wrote:
>
> Hi Stefan-
>
> > In that case, iterating twice duplicates most of the work, correct?
>
> I'm not sure I'd agree that it duplicates "most" of the work. This is an
> association faceting example, which is a little bit of a special case in
> some ways. But, to your question, there is duplicated work here of
> re-loading the ordinals across the two aggregations, but I would suspect
> the more expensive work is actually computing the different aggregations,
> which is not duplicated. You're right that it would likely be more
> efficient to iterate the hits once, loading the ordinals once and computing
> multiple aggregations in one pass. There's no facility for doing that
> currently in Lucene's faceting module, but you could always propose it! :)
> That said, I'm not sure how common of a case this really is for the
> majority of users? But that's just a guess/assumption.
>
> Cheers,
> -Greg
>
> On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <[email protected]>
> wrote:
>
> > Hi Greg,
> >
> > I see now where my example didn’t give enough info. In my mind, `Genre /
> > Author nationality / Author name` is stored in one hierarchical facet
> > field.
> > The data we’re aggregating over, like publish date or price, are stored in
> > DocValues.
> >
> > The demo package shows something similar [1], where the aggregation
> > is computed across a facet field using data from a `popularity` DocValue.
> >
> > In the demo, we compute `sum(_score * sqrt(popularity))`, but what if we
> > want several other different aggregations with respect to the same facet
> > field? Maybe we want `max(popularity)`. In that case, iterating twice
> > duplicates most of the work, correct?
> >
> >
> > Stefan
> >
> > [1]
> > https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> >
> > On Mon, 13 Feb 2023 at 22:46, Greg Miller <[email protected]> wrote:
> > >
> > > Hi Stefan-
> > >
> > > That helps, thanks. I'm a bit confused about where you're concerned with
> > > iterating over the match set multiple times. Is this a situation where
> > the
> > > ordinals you want to facet over are stored in different index fields, so
> > > you have to create multiple Facets instances (one per field) to compute
> > the
> > > aggregations? If that's the case, then yes—you have to iterate over the
> > > match set multiple times (once per field). I'm not sure that's such a big
> > > issue given that you're doing novel work during each iteration, so the
> > only
> > > repetitive cost is actually iterating the hits. If the ordinals are
> > > "packed" into the same field though (which is the default in Lucene if
> > > you're using taxonomy faceting), then you should only need to do a single
> > > iteration over that field.
> > >
> > > Cheers,
> > > -Greg
> > >
> > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <[email protected]>
> > > wrote:
> > >
> > > > Hi Greg,
> > > >
> > > > I’m assuming we have one match-set which was not constrained by any
> > > > of the categories we want to aggregate over, so it may have books by
> > > > Mark Twain, books by American authors, and sci-fi books.
> > > >
> > > > Maybe we can imagine we obtained it by searching for a keyword, say
> > > > “Washington”, which is present in Mark Twain’s writing, and those of
> > other
> > > > American authors, and in sci-fi novels too.
> > > >
> > > > Does that make the example clearer?
> > > >
> > > >
> > > > Stefan
> > > >
> > > >
> > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <[email protected]> wrote:
> > > > >
> > > > > Hi Stefan-
> > > > >
> > > > > Can you clarify your example a little bit? It sounds like you want to
> > > > facet
> > > > > over three different match sets (one constrained by "Mark Twain" as
> > the
> > > > > author, one constrained by "American authors" and one constrained by
> > the
> > > > > "sci-fi" genre). Is that correct?
> > > > >
> > > > > Cheers,
> > > > > -Greg
> > > > >
> > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Let’s say I have an index of books, similar to the example in the
> > facet
> > > > > > demo [1]
> > > > > > with a hierarchical facet field encapsulating `Genre / Author’s
> > > > > > nationality /
> > > > > > Author’s name`.
> > > > > >
> > > > > > I might like to find the latest publish date of a book written by
> > Mark
> > > > > > Twain, the
> > > > > > sum of the prices of books written by American authors, and the
> > number
> > > > of
> > > > > > sci-fi novels.
> > > > > >
> > > > > > As far as I understand, this would require faceting 3 times over
> > the
> > > > > > match-set,
> > > > > > one iteration for each aggregation of a different type (max(date),
> > > > > > sum(price),
> > > > > > count). That seems inefficient if we could instead compute all
> > > > > > aggregations in
> > > > > > one pass.
> > > > > >
> > > > > > Is there a way to do that?
> > > > > >
> > > > > >
> > > > > > Stefan
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> > https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > For additional commands, e-mail: [email protected]
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Computing multiple different aggregations over a match-set in one pass

Reply via email to