Re: Computing multiple different aggregations over a match-set in one pass

Stefan Vodita Sat, 09 Sep 2023 14:21:08 -0700

Hi everyone,

I ended up using the idea of doing multiple aggregations in one go and it
was
a nice improvement. Maybe we can reconsider introducing this? I've opened an
issue [1] and published a PR [2] based on the code I had previously shared,
with some extra tests and a few improvements.


Stefan

[1] https://github.com/apache/lucene/issues/12546
[2] https://github.com/apache/lucene/pull/12547

On Mon, 6 Mar 2023 at 19:46, Greg Miller <gsmil...@gmail.com> wrote:

> Hi Stefan-
>
> I opened https://github.com/apache/lucene/issues/12190 where we can
> discuss
> this further. Thanks for raising the idea!
>
> Cheers,
> -Greg
>
> On Mon, Mar 6, 2023 at 7:21 AM Stefan Vodita <stefan.vod...@gmail.com>
> wrote:
>
> > Hi Greg,
> >
> > The PR looks great. I think it's a useful feature to have and it helps
> > with the
> > use-case we were discussing. I left a comment with some other ideas that
> > I'd
> > like to explore.
> >
> > Thank you for coding this up,
> > Stefan
> >
> > On Sun, 5 Mar 2023 at 19:33, Greg Miller <gsmil...@gmail.com> wrote:
> > >
> > > Hi Stefan-
> > >
> > > I cobbled together a draft PR that I _think_ is what you're looking for
> > so
> > > we can have something to talk about. Please let me know if this misses
> > the
> > > mark, or is what you had in mind. If so, we could open an issue to
> > propose
> > > the idea of adding something like this. I'm not totally convinced I
> like
> > it
> > > (I think the expression syntax/API is a little wonky), but that's
> > something
> > > we could discuss in an issue.
> > >
> > > https://github.com/apache/lucene/pull/12184
> > >
> > > Cheers,
> > > -Greg
> > >
> > > On Fri, Feb 24, 2023 at 1:57 PM Stefan Vodita <stefan.vod...@gmail.com
> >
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Greg and I discussed a bit offline. His assessment was right - I’m
> not
> > > > looking
> > > > to compute multiple values per ordinal as an end in itself. That is
> > only a
> > > > means
> > > > to compute a single value which depends on other facet results. This
> > > > section from
> > > > the previous email explains it really well:
> > > >
> > > > > For example, if we're using the geonames data you have in your
> > example,
> > > > > maybe the value you want to associate with a given path is
> something
> > like
> > > > > `max(population) + sum(elevation)`, where `max(population)` and
> > > > `sum(elevation)`
> > > > > are the result of two independent facet associations. Then, you
> could
> > > > combine
> > > > > those results though some expression to derive a single value for a
> > > > given path.
> > > >
> > > > Ideally, I could facet using an expression which binds other
> > > > aggregations. The user
> > > > experience might be as simple as defining the expression and making a
> > > > single
> > > > faceting call. Has anyone worked on something similar?
> > > >
> > > > Best,
> > > > Stefan
> > > >
> > > > On Thu, 23 Feb 2023 at 16:53, Greg Miller <gsmil...@gmail.com>
> wrote:
> > > > >
> > > > > Thanks for the detailed benchmarking Stefan! I think you've
> > demonstrated
> > > > > here that looping over the collected hits multiple times does in
> > fact add
> > > > > meaningful overhead. That's interesting to see!
> > > > >
> > > > > As for whether-or-not to add functionality to the facets module
> that
> > > > > supports this, I'm not convinced at this point. I think what you're
> > > > > suggesting here—but please correct me if I'm wrong—is supporting
> > > > > association faceting where the user wants to compute multiple
> > association
> > > > > aggregations for the same dimensions in a single pass. Where I'm
> > > > struggling
> > > > > to connect a real-world use-case though is what the user is going
> to
> > > > > actually do with those multiple association values. The Facets API
> > today
> > > > (
> > > > >
> > > >
> >
> https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/Facets.java
> > > > )
> > > > > has a pretty firm assumption built in that dimensions/paths have a
> > single
> > > > > value associated with them. So building some sort of association
> > faceting
> > > > > implementation that exposes more than one value associated with a
> > given
> > > > > dimension/path is a significant change to the current model, and
> I'm
> > not
> > > > > sure it supports enough real-world use to warrant the complexity.
> > > > >
> > > > > OK, now disclaimer: Stefan and I work together so I think I have an
> > idea
> > > > of
> > > > > what he's doing here...
> > > > >
> > > > > What I think you're actually after here—and the one use-case I
> could
> > > > > imagine some other users being interested in—is computing a single
> > value
> > > > > for each dimension/path that is actually an expression over _other_
> > > > > aggregated values. For example, if we're using the geonames data
> you
> > have
> > > > > in your example, maybe the value you want to associate with a given
> > path
> > > > is
> > > > > something like `max(population) + sum(elevation)`, where
> > > > `max(population)`
> > > > > and `sum(elevation)` are the result of two independent facet
> > > > associations.
> > > > > Then, you could combine those results though some expression to
> > derive a
> > > > > single value for a given path. That end result still fits the
> Facets
> > API
> > > > > well, but supporting something like this in Lucene requires a few
> > other
> > > > > primitives beyond just the ability to compute multiple associations
> > at
> > > > the
> > > > > same time. Primarily, it needs some version of Expression +
> Bindings
> > that
> > > > > works for dimensions/paths. So I don't think the ability to compute
> > > > > multiple associations at once is really the key missing feature
> here,
> > > > and I
> > > > > don't think it adds significant value on its own to warrant the
> > > > complexity
> > > > > of trying to expose it through the existings Facets API. Of course,
> > > > there's
> > > > > nothing preventing users from building this "multiple association"
> > > > > functionality themselves.
> > > > >
> > > > > That's my take on this, but maybe I'm missing some other use-cases
> > that
> > > > > could justify adding this capability in a general way? What do you
> > think?
> > > > >
> > > > > Cheers,
> > > > > -Greg
> > > > >
> > > > > On Fri, Feb 17, 2023 at 3:14 PM Stefan Vodita <
> > stefan.vod...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > After benchmarking my implementation against the existing one, I
> > think
> > > > > > there is
> > > > > > some meaningful overhead. I built a small driver [1] that runs
> the
> > two
> > > > > > solutions over
> > > > > > a geo data [2] index (thank you Greg for providing the indexing
> > code!).
> > > > > >
> > > > > > The table below lists faceting times in milliseconds. I’ve named
> > the
> > > > > > current
> > > > > > implementation serial and my proposal parallel, for lack of
> better
> > > > names.
> > > > > > The
> > > > > > aggregation function is a no-op, so we’re only measuring the time
> > spent
> > > > > > outside
> > > > > > aggregation. The measurements are over a match-set of 100k docs,
> > but
> > > > the
> > > > > > number
> > > > > > of docs does not have a large impact on the results because the
> > > > aggregation
> > > > > > function isn’t doing any work.
> > > > > >
> > > > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > > > > Faceting Time (ms) |
> > > > > >
> > > > > >
> > > >
> >
> |----------------------------------|------------------------------------|--------------------------------------|
> > > > > > | 2                                      |
> > > > > >        510 |                                      328 |
> > > > > > | 5                                      |
> > > > > >      1211 |                                      775 |
> > > > > > | 10                                    |
> > > > > >     2366 |                                    1301 |
> > > > > >
> > > > > > If we use a MAX aggregation over a DocValue instead, the results
> > tell a
> > > > > > similar
> > > > > > story. In this case, the number of docs matters. I've attached
> > results
> > > > > > for 10 docs and
> > > > > > 100k docs.
> > > > > >
> > > > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > > > > Faceting Time (ms) |
> > > > > >
> > > > > >
> > > >
> >
> |----------------------------------|------------------------------------|--------------------------------------|
> > > > > > | 2                                      |
> > > > > >        706 |                                      505 |
> > > > > > | 5                                      |
> > > > > >      1618 |                                     1119 |
> > > > > > | 10                                    |
> > > > > >     3152 |                                    2018 |
> > > > > >
> > > > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > > > > Faceting Time (ms) |
> > > > > >
> > > > > >
> > > >
> >
> |----------------------------------|------------------------------------|--------------------------------------|
> > > > > > | 2                                      |
> > > > > >        904 |                                      655 |
> > > > > > | 5                                      |
> > > > > >      2122 |                                    1491 |
> > > > > > | 10                                    |
> > > > > >     5062 |                                    3317 |
> > > > > >
> > > > > > With 10 aggregations, we're saving a second or more. That is
> > > > significant
> > > > > > for my
> > > > > > use-case.
> > > > > >
> > > > > > I'd like to know if the test and results seem reasonable. If so,
> > maybe
> > > > > > we can think
> > > > > > about providing this functionality.
> > > > > >
> > > > > > Thanks,
> > > > > > Stefan
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> >
> https://github.com/stefanvodita/lucene/commit/3536546cd9f833150db001e8eede093723cf7663
> > > > > > [2] https://download.geonames.org/export/dump/allCountries.zip
> > > > > >
> > > > > >
> > > > > > On Fri, 17 Feb 2023 at 18:45, Greg Miller <gsmil...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > Thanks for the follow up Stefan. If you find significant
> overhead
> > > > > > > associated with the multiple iterations, please keep
> challenging
> > the
> > > > > > > current approach and suggest improvements. It's always good to
> > > > revisit
> > > > > > > these things!
> > > > > > >
> > > > > > > Cheers,
> > > > > > > -Greg
> > > > > > >
> > > > > > > On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita <
> > > > stefan.vod...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Greg,
> > > > > > > >
> > > > > > > > To better understand how much work gets duplicated, I went
> > ahead
> > > > > > > > and modified FloatTaxonomyFacets as an example [1]. It
> doesn't
> > look
> > > > > > > > too pretty, but it illustrates how I think multiple
> > aggregations
> > > > in one
> > > > > > > > iteration could work.
> > > > > > > >
> > > > > > > > Overall, you're right, there's not as much wasted work as I
> had
> > > > > > > > expected. I'll try to do a performance comparison to quantify
> > > > precisely
> > > > > > > > how much time we could save, just in case.
> > > > > > > >
> > > > > > > > Thank you the help!
> > > > > > > > Stefan
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > >
> > > >
> >
> https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684
> > > > > > > >
> > > > > > > > On Wed, 15 Feb 2023 at 15:48, Greg Miller <
> gsmil...@gmail.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Stefan-
> > > > > > > > >
> > > > > > > > > > In that case, iterating twice duplicates most of the
> work,
> > > > correct?
> > > > > > > > >
> > > > > > > > > I'm not sure I'd agree that it duplicates "most" of the
> work.
> > > > This
> > > > > > is an
> > > > > > > > > association faceting example, which is a little bit of a
> > special
> > > > > > case in
> > > > > > > > > some ways. But, to your question, there is duplicated work
> > here
> > > > of
> > > > > > > > > re-loading the ordinals across the two aggregations, but I
> > would
> > > > > > suspect
> > > > > > > > > the more expensive work is actually computing the different
> > > > > > aggregations,
> > > > > > > > > which is not duplicated. You're right that it would likely
> be
> > > > more
> > > > > > > > > efficient to iterate the hits once, loading the ordinals
> > once and
> > > > > > > > computing
> > > > > > > > > multiple aggregations in one pass. There's no facility for
> > doing
> > > > that
> > > > > > > > > currently in Lucene's faceting module, but you could always
> > > > propose
> > > > > > it!
> > > > > > > > :)
> > > > > > > > > That said, I'm not sure how common of a case this really is
> > for
> > > > the
> > > > > > > > > majority of users? But that's just a guess/assumption.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > -Greg
> > > > > > > > >
> > > > > > > > > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <
> > > > > > stefan.vod...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Greg,
> > > > > > > > > >
> > > > > > > > > > I see now where my example didn’t give enough info. In my
> > mind,
> > > > > > `Genre
> > > > > > > > /
> > > > > > > > > > Author nationality / Author name` is stored in one
> > hierarchical
> > > > > > facet
> > > > > > > > > > field.
> > > > > > > > > > The data we’re aggregating over, like publish date or
> > price,
> > > > are
> > > > > > > > stored in
> > > > > > > > > > DocValues.
> > > > > > > > > >
> > > > > > > > > > The demo package shows something similar [1], where the
> > > > aggregation
> > > > > > > > > > is computed across a facet field using data from a
> > `popularity`
> > > > > > > > DocValue.
> > > > > > > > > >
> > > > > > > > > > In the demo, we compute `sum(_score * sqrt(popularity))`,
> > but
> > > > what
> > > > > > if
> > > > > > > > we
> > > > > > > > > > want several other different aggregations with respect to
> > the
> > > > same
> > > > > > > > facet
> > > > > > > > > > field? Maybe we want `max(popularity)`. In that case,
> > iterating
> > > > > > twice
> > > > > > > > > > duplicates most of the work, correct?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Stefan
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >
> https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> > > > > > > > > >
> > > > > > > > > > On Mon, 13 Feb 2023 at 22:46, Greg Miller <
> > gsmil...@gmail.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi Stefan-
> > > > > > > > > > >
> > > > > > > > > > > That helps, thanks. I'm a bit confused about where
> you're
> > > > > > concerned
> > > > > > > > with
> > > > > > > > > > > iterating over the match set multiple times. Is this a
> > > > situation
> > > > > > > > where
> > > > > > > > > > the
> > > > > > > > > > > ordinals you want to facet over are stored in different
> > index
> > > > > > > > fields, so
> > > > > > > > > > > you have to create multiple Facets instances (one per
> > field)
> > > > to
> > > > > > > > compute
> > > > > > > > > > the
> > > > > > > > > > > aggregations? If that's the case, then yes—you have to
> > > > iterate
> > > > > > over
> > > > > > > > the
> > > > > > > > > > > match set multiple times (once per field). I'm not sure
> > > > that's
> > > > > > such
> > > > > > > > a big
> > > > > > > > > > > issue given that you're doing novel work during each
> > > > iteration,
> > > > > > so
> > > > > > > > the
> > > > > > > > > > only
> > > > > > > > > > > repetitive cost is actually iterating the hits. If the
> > > > ordinals
> > > > > > are
> > > > > > > > > > > "packed" into the same field though (which is the
> > default in
> > > > > > Lucene
> > > > > > > > if
> > > > > > > > > > > you're using taxonomy faceting), then you should only
> > need
> > > > to do
> > > > > > a
> > > > > > > > single
> > > > > > > > > > > iteration over that field.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > -Greg
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <
> > > > > > > > stefan.vod...@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Greg,
> > > > > > > > > > > >
> > > > > > > > > > > > I’m assuming we have one match-set which was not
> > > > constrained
> > > > > > by any
> > > > > > > > > > > > of the categories we want to aggregate over, so it
> may
> > have
> > > > > > books
> > > > > > > > by
> > > > > > > > > > > > Mark Twain, books by American authors, and sci-fi
> > books.
> > > > > > > > > > > >
> > > > > > > > > > > > Maybe we can imagine we obtained it by searching for
> a
> > > > > > keyword, say
> > > > > > > > > > > > “Washington”, which is present in Mark Twain’s
> > writing, and
> > > > > > those
> > > > > > > > of
> > > > > > > > > > other
> > > > > > > > > > > > American authors, and in sci-fi novels too.
> > > > > > > > > > > >
> > > > > > > > > > > > Does that make the example clearer?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Stefan
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <
> > > > gsmil...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Stefan-
> > > > > > > > > > > > >
> > > > > > > > > > > > > Can you clarify your example a little bit? It
> sounds
> > > > like you
> > > > > > > > want to
> > > > > > > > > > > > facet
> > > > > > > > > > > > > over three different match sets (one constrained by
> > "Mark
> > > > > > Twain"
> > > > > > > > as
> > > > > > > > > > the
> > > > > > > > > > > > > author, one constrained by "American authors" and
> one
> > > > > > > > constrained by
> > > > > > > > > > the
> > > > > > > > > > > > > "sci-fi" genre). Is that correct?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > -Greg
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > > > > > > > > > stefan.vod...@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Let’s say I have an index of books, similar to
> the
> > > > example
> > > > > > in
> > > > > > > > the
> > > > > > > > > > facet
> > > > > > > > > > > > > > demo [1]
> > > > > > > > > > > > > > with a hierarchical facet field encapsulating
> > `Genre /
> > > > > > Author’s
> > > > > > > > > > > > > > nationality /
> > > > > > > > > > > > > > Author’s name`.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I might like to find the latest publish date of a
> > book
> > > > > > written
> > > > > > > > by
> > > > > > > > > > Mark
> > > > > > > > > > > > > > Twain, the
> > > > > > > > > > > > > > sum of the prices of books written by American
> > authors,
> > > > > > and the
> > > > > > > > > > number
> > > > > > > > > > > > of
> > > > > > > > > > > > > > sci-fi novels.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > As far as I understand, this would require
> > faceting 3
> > > > times
> > > > > > > > over
> > > > > > > > > > the
> > > > > > > > > > > > > > match-set,
> > > > > > > > > > > > > > one iteration for each aggregation of a different
> > type
> > > > > > > > (max(date),
> > > > > > > > > > > > > > sum(price),
> > > > > > > > > > > > > > count). That seems inefficient if we could
> instead
> > > > compute
> > > > > > all
> > > > > > > > > > > > > > aggregations in
> > > > > > > > > > > > > > one pass.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Is there a way to do that?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Stefan
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >
> https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > > > > > > > > > To unsubscribe, e-mail:
> > > > > > > > java-user-unsubscr...@lucene.apache.org
> > > > > > > > > > > > > > For additional commands, e-mail:
> > > > > > > > java-user-h...@lucene.apache.org
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > > > > > To unsubscribe, e-mail:
> > > > > > java-user-unsubscr...@lucene.apache.org
> > > > > > > > > > > > For additional commands, e-mail:
> > > > > > java-user-h...@lucene.apache.org
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail:
> > > > java-user-unsubscr...@lucene.apache.org
> > > > > > > > > > For additional commands, e-mail:
> > > > java-user-h...@lucene.apache.org
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail:
> > java-user-unsubscr...@lucene.apache.org
> > > > > > > > For additional commands, e-mail:
> > java-user-h...@lucene.apache.org
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > > > For additional commands, e-mail:
> java-user-h...@lucene.apache.org
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>

Re: Computing multiple different aggregations over a match-set in one pass

Reply via email to