Re: Computing multiple different aggregations over a match-set in one pass

Stefan Vodita Fri, 17 Feb 2023 15:14:54 -0800

After benchmarking my implementation against the existing one, I think there is
some meaningful overhead. I built a small driver [1] that runs the two
solutions over
a geo data [2] index (thank you Greg for providing the indexing code!).


The table below lists faceting times in milliseconds. I’ve named the current
implementation serial and my proposal parallel, for lack of better names. The
aggregation function is a no-op, so we’re only measuring the time spent outside
aggregation. The measurements are over a match-set of 100k docs, but the number
of docs does not have a large impact on the results because the aggregation
function isn’t doing any work.

| Number of Aggregations | Serial Faceting Time (ms) | Parallel
Faceting Time (ms) |
|----------------------------------|------------------------------------|--------------------------------------|
| 2                                      |
       510 |                                      328 |
| 5                                      |
     1211 |                                      775 |
| 10                                    |
    2366 |                                    1301 |

If we use a MAX aggregation over a DocValue instead, the results tell a similar
story. In this case, the number of docs matters. I've attached results
for 10 docs and
100k docs.

| Number of Aggregations | Serial Faceting Time (ms) | Parallel
Faceting Time (ms) |
|----------------------------------|------------------------------------|--------------------------------------|
| 2                                      |
       706 |                                      505 |
| 5                                      |
     1618 |                                     1119 |
| 10                                    |
    3152 |                                    2018 |

| Number of Aggregations | Serial Faceting Time (ms) | Parallel
Faceting Time (ms) |
|----------------------------------|------------------------------------|--------------------------------------|
| 2                                      |
       904 |                                      655 |
| 5                                      |
     2122 |                                    1491 |
| 10                                    |
    5062 |                                    3317 |

With 10 aggregations, we're saving a second or more. That is significant for my
use-case.

I'd like to know if the test and results seem reasonable. If so, maybe
we can think
about providing this functionality.

Thanks,
Stefan

[1] 
https://github.com/stefanvodita/lucene/commit/3536546cd9f833150db001e8eede093723cf7663
[2] https://download.geonames.org/export/dump/allCountries.zip


On Fri, 17 Feb 2023 at 18:45, Greg Miller <gsmil...@gmail.com> wrote:
>
> Thanks for the follow up Stefan. If you find significant overhead
> associated with the multiple iterations, please keep challenging the
> current approach and suggest improvements. It's always good to revisit
> these things!
>
> Cheers,
> -Greg
>
> On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita <stefan.vod...@gmail.com>
> wrote:
>
> > Hi Greg,
> >
> > To better understand how much work gets duplicated, I went ahead
> > and modified FloatTaxonomyFacets as an example [1]. It doesn't look
> > too pretty, but it illustrates how I think multiple aggregations in one
> > iteration could work.
> >
> > Overall, you're right, there's not as much wasted work as I had
> > expected. I'll try to do a performance comparison to quantify precisely
> > how much time we could save, just in case.
> >
> > Thank you the help!
> > Stefan
> >
> > [1]
> > https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684
> >
> > On Wed, 15 Feb 2023 at 15:48, Greg Miller <gsmil...@gmail.com> wrote:
> > >
> > > Hi Stefan-
> > >
> > > > In that case, iterating twice duplicates most of the work, correct?
> > >
> > > I'm not sure I'd agree that it duplicates "most" of the work. This is an
> > > association faceting example, which is a little bit of a special case in
> > > some ways. But, to your question, there is duplicated work here of
> > > re-loading the ordinals across the two aggregations, but I would suspect
> > > the more expensive work is actually computing the different aggregations,
> > > which is not duplicated. You're right that it would likely be more
> > > efficient to iterate the hits once, loading the ordinals once and
> > computing
> > > multiple aggregations in one pass. There's no facility for doing that
> > > currently in Lucene's faceting module, but you could always propose it!
> > :)
> > > That said, I'm not sure how common of a case this really is for the
> > > majority of users? But that's just a guess/assumption.
> > >
> > > Cheers,
> > > -Greg
> > >
> > > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <stefan.vod...@gmail.com>
> > > wrote:
> > >
> > > > Hi Greg,
> > > >
> > > > I see now where my example didn’t give enough info. In my mind, `Genre
> > /
> > > > Author nationality / Author name` is stored in one hierarchical facet
> > > > field.
> > > > The data we’re aggregating over, like publish date or price, are
> > stored in
> > > > DocValues.
> > > >
> > > > The demo package shows something similar [1], where the aggregation
> > > > is computed across a facet field using data from a `popularity`
> > DocValue.
> > > >
> > > > In the demo, we compute `sum(_score * sqrt(popularity))`, but what if
> > we
> > > > want several other different aggregations with respect to the same
> > facet
> > > > field? Maybe we want `max(popularity)`. In that case, iterating twice
> > > > duplicates most of the work, correct?
> > > >
> > > >
> > > > Stefan
> > > >
> > > > [1]
> > > >
> > https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> > > >
> > > > On Mon, 13 Feb 2023 at 22:46, Greg Miller <gsmil...@gmail.com> wrote:
> > > > >
> > > > > Hi Stefan-
> > > > >
> > > > > That helps, thanks. I'm a bit confused about where you're concerned
> > with
> > > > > iterating over the match set multiple times. Is this a situation
> > where
> > > > the
> > > > > ordinals you want to facet over are stored in different index
> > fields, so
> > > > > you have to create multiple Facets instances (one per field) to
> > compute
> > > > the
> > > > > aggregations? If that's the case, then yes—you have to iterate over
> > the
> > > > > match set multiple times (once per field). I'm not sure that's such
> > a big
> > > > > issue given that you're doing novel work during each iteration, so
> > the
> > > > only
> > > > > repetitive cost is actually iterating the hits. If the ordinals are
> > > > > "packed" into the same field though (which is the default in Lucene
> > if
> > > > > you're using taxonomy faceting), then you should only need to do a
> > single
> > > > > iteration over that field.
> > > > >
> > > > > Cheers,
> > > > > -Greg
> > > > >
> > > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <
> > stefan.vod...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Greg,
> > > > > >
> > > > > > I’m assuming we have one match-set which was not constrained by any
> > > > > > of the categories we want to aggregate over, so it may have books
> > by
> > > > > > Mark Twain, books by American authors, and sci-fi books.
> > > > > >
> > > > > > Maybe we can imagine we obtained it by searching for a keyword, say
> > > > > > “Washington”, which is present in Mark Twain’s writing, and those
> > of
> > > > other
> > > > > > American authors, and in sci-fi novels too.
> > > > > >
> > > > > > Does that make the example clearer?
> > > > > >
> > > > > >
> > > > > > Stefan
> > > > > >
> > > > > >
> > > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <gsmil...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > Hi Stefan-
> > > > > > >
> > > > > > > Can you clarify your example a little bit? It sounds like you
> > want to
> > > > > > facet
> > > > > > > over three different match sets (one constrained by "Mark Twain"
> > as
> > > > the
> > > > > > > author, one constrained by "American authors" and one
> > constrained by
> > > > the
> > > > > > > "sci-fi" genre). Is that correct?
> > > > > > >
> > > > > > > Cheers,
> > > > > > > -Greg
> > > > > > >
> > > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > > > stefan.vod...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Let’s say I have an index of books, similar to the example in
> > the
> > > > facet
> > > > > > > > demo [1]
> > > > > > > > with a hierarchical facet field encapsulating `Genre / Author’s
> > > > > > > > nationality /
> > > > > > > > Author’s name`.
> > > > > > > >
> > > > > > > > I might like to find the latest publish date of a book written
> > by
> > > > Mark
> > > > > > > > Twain, the
> > > > > > > > sum of the prices of books written by American authors, and the
> > > > number
> > > > > > of
> > > > > > > > sci-fi novels.
> > > > > > > >
> > > > > > > > As far as I understand, this would require faceting 3 times
> > over
> > > > the
> > > > > > > > match-set,
> > > > > > > > one iteration for each aggregation of a different type
> > (max(date),
> > > > > > > > sum(price),
> > > > > > > > count). That seems inefficient if we could instead compute all
> > > > > > > > aggregations in
> > > > > > > > one pass.
> > > > > > > >
> > > > > > > > Is there a way to do that?
> > > > > > > >
> > > > > > > >
> > > > > > > > Stefan
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > >
> > > >
> > https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail:
> > java-user-unsubscr...@lucene.apache.org
> > > > > > > > For additional commands, e-mail:
> > java-user-h...@lucene.apache.org
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Computing multiple different aggregations over a match-set in one pass

Reply via email to