Re: Computing multiple different aggregations over a match-set in one pass

Greg Miller Sun, 05 Mar 2023 11:33:26 -0800

Hi Stefan-

I cobbled together a draft PR that I _think_ is what you're looking for so
we can have something to talk about. Please let me know if this misses the
mark, or is what you had in mind. If so, we could open an issue to propose
the idea of adding something like this. I'm not totally convinced I like it
(I think the expression syntax/API is a little wonky), but that's something
we could discuss in an issue.


https://github.com/apache/lucene/pull/12184

Cheers,
-Greg

On Fri, Feb 24, 2023 at 1:57 PM Stefan Vodita <[email protected]>
wrote:

> Hi everyone,
>
> Greg and I discussed a bit offline. His assessment was right - I’m not
> looking
> to compute multiple values per ordinal as an end in itself. That is only a
> means
> to compute a single value which depends on other facet results. This
> section from
> the previous email explains it really well:
>
> > For example, if we're using the geonames data you have in your example,
> > maybe the value you want to associate with a given path is something like
> > `max(population) + sum(elevation)`, where `max(population)` and
> `sum(elevation)`
> > are the result of two independent facet associations. Then, you could
> combine
> > those results though some expression to derive a single value for a
> given path.
>
> Ideally, I could facet using an expression which binds other
> aggregations. The user
> experience might be as simple as defining the expression and making a
> single
> faceting call. Has anyone worked on something similar?
>
> Best,
> Stefan
>
> On Thu, 23 Feb 2023 at 16:53, Greg Miller <[email protected]> wrote:
> >
> > Thanks for the detailed benchmarking Stefan! I think you've demonstrated
> > here that looping over the collected hits multiple times does in fact add
> > meaningful overhead. That's interesting to see!
> >
> > As for whether-or-not to add functionality to the facets module that
> > supports this, I'm not convinced at this point. I think what you're
> > suggesting here—but please correct me if I'm wrong—is supporting
> > association faceting where the user wants to compute multiple association
> > aggregations for the same dimensions in a single pass. Where I'm
> struggling
> > to connect a real-world use-case though is what the user is going to
> > actually do with those multiple association values. The Facets API today
> (
> >
> https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/Facets.java
> )
> > has a pretty firm assumption built in that dimensions/paths have a single
> > value associated with them. So building some sort of association faceting
> > implementation that exposes more than one value associated with a given
> > dimension/path is a significant change to the current model, and I'm not
> > sure it supports enough real-world use to warrant the complexity.
> >
> > OK, now disclaimer: Stefan and I work together so I think I have an idea
> of
> > what he's doing here...
> >
> > What I think you're actually after here—and the one use-case I could
> > imagine some other users being interested in—is computing a single value
> > for each dimension/path that is actually an expression over _other_
> > aggregated values. For example, if we're using the geonames data you have
> > in your example, maybe the value you want to associate with a given path
> is
> > something like `max(population) + sum(elevation)`, where
> `max(population)`
> > and `sum(elevation)` are the result of two independent facet
> associations.
> > Then, you could combine those results though some expression to derive a
> > single value for a given path. That end result still fits the Facets API
> > well, but supporting something like this in Lucene requires a few other
> > primitives beyond just the ability to compute multiple associations at
> the
> > same time. Primarily, it needs some version of Expression + Bindings that
> > works for dimensions/paths. So I don't think the ability to compute
> > multiple associations at once is really the key missing feature here,
> and I
> > don't think it adds significant value on its own to warrant the
> complexity
> > of trying to expose it through the existings Facets API. Of course,
> there's
> > nothing preventing users from building this "multiple association"
> > functionality themselves.
> >
> > That's my take on this, but maybe I'm missing some other use-cases that
> > could justify adding this capability in a general way? What do you think?
> >
> > Cheers,
> > -Greg
> >
> > On Fri, Feb 17, 2023 at 3:14 PM Stefan Vodita <[email protected]>
> > wrote:
> >
> > > After benchmarking my implementation against the existing one, I think
> > > there is
> > > some meaningful overhead. I built a small driver [1] that runs the two
> > > solutions over
> > > a geo data [2] index (thank you Greg for providing the indexing code!).
> > >
> > > The table below lists faceting times in milliseconds. I’ve named the
> > > current
> > > implementation serial and my proposal parallel, for lack of better
> names.
> > > The
> > > aggregation function is a no-op, so we’re only measuring the time spent
> > > outside
> > > aggregation. The measurements are over a match-set of 100k docs, but
> the
> > > number
> > > of docs does not have a large impact on the results because the
> aggregation
> > > function isn’t doing any work.
> > >
> > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > Faceting Time (ms) |
> > >
> > >
> |----------------------------------|------------------------------------|--------------------------------------|
> > > | 2                                      |
> > >        510 |                                      328 |
> > > | 5                                      |
> > >      1211 |                                      775 |
> > > | 10                                    |
> > >     2366 |                                    1301 |
> > >
> > > If we use a MAX aggregation over a DocValue instead, the results tell a
> > > similar
> > > story. In this case, the number of docs matters. I've attached results
> > > for 10 docs and
> > > 100k docs.
> > >
> > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > Faceting Time (ms) |
> > >
> > >
> |----------------------------------|------------------------------------|--------------------------------------|
> > > | 2                                      |
> > >        706 |                                      505 |
> > > | 5                                      |
> > >      1618 |                                     1119 |
> > > | 10                                    |
> > >     3152 |                                    2018 |
> > >
> > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > Faceting Time (ms) |
> > >
> > >
> |----------------------------------|------------------------------------|--------------------------------------|
> > > | 2                                      |
> > >        904 |                                      655 |
> > > | 5                                      |
> > >      2122 |                                    1491 |
> > > | 10                                    |
> > >     5062 |                                    3317 |
> > >
> > > With 10 aggregations, we're saving a second or more. That is
> significant
> > > for my
> > > use-case.
> > >
> > > I'd like to know if the test and results seem reasonable. If so, maybe
> > > we can think
> > > about providing this functionality.
> > >
> > > Thanks,
> > > Stefan
> > >
> > > [1]
> > >
> https://github.com/stefanvodita/lucene/commit/3536546cd9f833150db001e8eede093723cf7663
> > > [2] https://download.geonames.org/export/dump/allCountries.zip
> > >
> > >
> > > On Fri, 17 Feb 2023 at 18:45, Greg Miller <[email protected]> wrote:
> > > >
> > > > Thanks for the follow up Stefan. If you find significant overhead
> > > > associated with the multiple iterations, please keep challenging the
> > > > current approach and suggest improvements. It's always good to
> revisit
> > > > these things!
> > > >
> > > > Cheers,
> > > > -Greg
> > > >
> > > > On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita <
> [email protected]>
> > > > wrote:
> > > >
> > > > > Hi Greg,
> > > > >
> > > > > To better understand how much work gets duplicated, I went ahead
> > > > > and modified FloatTaxonomyFacets as an example [1]. It doesn't look
> > > > > too pretty, but it illustrates how I think multiple aggregations
> in one
> > > > > iteration could work.
> > > > >
> > > > > Overall, you're right, there's not as much wasted work as I had
> > > > > expected. I'll try to do a performance comparison to quantify
> precisely
> > > > > how much time we could save, just in case.
> > > > >
> > > > > Thank you the help!
> > > > > Stefan
> > > > >
> > > > > [1]
> > > > >
> > >
> https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684
> > > > >
> > > > > On Wed, 15 Feb 2023 at 15:48, Greg Miller <[email protected]>
> wrote:
> > > > > >
> > > > > > Hi Stefan-
> > > > > >
> > > > > > > In that case, iterating twice duplicates most of the work,
> correct?
> > > > > >
> > > > > > I'm not sure I'd agree that it duplicates "most" of the work.
> This
> > > is an
> > > > > > association faceting example, which is a little bit of a special
> > > case in
> > > > > > some ways. But, to your question, there is duplicated work here
> of
> > > > > > re-loading the ordinals across the two aggregations, but I would
> > > suspect
> > > > > > the more expensive work is actually computing the different
> > > aggregations,
> > > > > > which is not duplicated. You're right that it would likely be
> more
> > > > > > efficient to iterate the hits once, loading the ordinals once and
> > > > > computing
> > > > > > multiple aggregations in one pass. There's no facility for doing
> that
> > > > > > currently in Lucene's faceting module, but you could always
> propose
> > > it!
> > > > > :)
> > > > > > That said, I'm not sure how common of a case this really is for
> the
> > > > > > majority of users? But that's just a guess/assumption.
> > > > > >
> > > > > > Cheers,
> > > > > > -Greg
> > > > > >
> > > > > > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <
> > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Greg,
> > > > > > >
> > > > > > > I see now where my example didn’t give enough info. In my mind,
> > > `Genre
> > > > > /
> > > > > > > Author nationality / Author name` is stored in one hierarchical
> > > facet
> > > > > > > field.
> > > > > > > The data we’re aggregating over, like publish date or price,
> are
> > > > > stored in
> > > > > > > DocValues.
> > > > > > >
> > > > > > > The demo package shows something similar [1], where the
> aggregation
> > > > > > > is computed across a facet field using data from a `popularity`
> > > > > DocValue.
> > > > > > >
> > > > > > > In the demo, we compute `sum(_score * sqrt(popularity))`, but
> what
> > > if
> > > > > we
> > > > > > > want several other different aggregations with respect to the
> same
> > > > > facet
> > > > > > > field? Maybe we want `max(popularity)`. In that case, iterating
> > > twice
> > > > > > > duplicates most of the work, correct?
> > > > > > >
> > > > > > >
> > > > > > > Stefan
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > >
> > >
> https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> > > > > > >
> > > > > > > On Mon, 13 Feb 2023 at 22:46, Greg Miller <[email protected]>
> > > wrote:
> > > > > > > >
> > > > > > > > Hi Stefan-
> > > > > > > >
> > > > > > > > That helps, thanks. I'm a bit confused about where you're
> > > concerned
> > > > > with
> > > > > > > > iterating over the match set multiple times. Is this a
> situation
> > > > > where
> > > > > > > the
> > > > > > > > ordinals you want to facet over are stored in different index
> > > > > fields, so
> > > > > > > > you have to create multiple Facets instances (one per field)
> to
> > > > > compute
> > > > > > > the
> > > > > > > > aggregations? If that's the case, then yes—you have to
> iterate
> > > over
> > > > > the
> > > > > > > > match set multiple times (once per field). I'm not sure
> that's
> > > such
> > > > > a big
> > > > > > > > issue given that you're doing novel work during each
> iteration,
> > > so
> > > > > the
> > > > > > > only
> > > > > > > > repetitive cost is actually iterating the hits. If the
> ordinals
> > > are
> > > > > > > > "packed" into the same field though (which is the default in
> > > Lucene
> > > > > if
> > > > > > > > you're using taxonomy faceting), then you should only need
> to do
> > > a
> > > > > single
> > > > > > > > iteration over that field.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > -Greg
> > > > > > > >
> > > > > > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <
> > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Greg,
> > > > > > > > >
> > > > > > > > > I’m assuming we have one match-set which was not
> constrained
> > > by any
> > > > > > > > > of the categories we want to aggregate over, so it may have
> > > books
> > > > > by
> > > > > > > > > Mark Twain, books by American authors, and sci-fi books.
> > > > > > > > >
> > > > > > > > > Maybe we can imagine we obtained it by searching for a
> > > keyword, say
> > > > > > > > > “Washington”, which is present in Mark Twain’s writing, and
> > > those
> > > > > of
> > > > > > > other
> > > > > > > > > American authors, and in sci-fi novels too.
> > > > > > > > >
> > > > > > > > > Does that make the example clearer?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Stefan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <
> [email protected]>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Stefan-
> > > > > > > > > >
> > > > > > > > > > Can you clarify your example a little bit? It sounds
> like you
> > > > > want to
> > > > > > > > > facet
> > > > > > > > > > over three different match sets (one constrained by "Mark
> > > Twain"
> > > > > as
> > > > > > > the
> > > > > > > > > > author, one constrained by "American authors" and one
> > > > > constrained by
> > > > > > > the
> > > > > > > > > > "sci-fi" genre). Is that correct?
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > -Greg
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > Let’s say I have an index of books, similar to the
> example
> > > in
> > > > > the
> > > > > > > facet
> > > > > > > > > > > demo [1]
> > > > > > > > > > > with a hierarchical facet field encapsulating `Genre /
> > > Author’s
> > > > > > > > > > > nationality /
> > > > > > > > > > > Author’s name`.
> > > > > > > > > > >
> > > > > > > > > > > I might like to find the latest publish date of a book
> > > written
> > > > > by
> > > > > > > Mark
> > > > > > > > > > > Twain, the
> > > > > > > > > > > sum of the prices of books written by American authors,
> > > and the
> > > > > > > number
> > > > > > > > > of
> > > > > > > > > > > sci-fi novels.
> > > > > > > > > > >
> > > > > > > > > > > As far as I understand, this would require faceting 3
> times
> > > > > over
> > > > > > > the
> > > > > > > > > > > match-set,
> > > > > > > > > > > one iteration for each aggregation of a different type
> > > > > (max(date),
> > > > > > > > > > > sum(price),
> > > > > > > > > > > count). That seems inefficient if we could instead
> compute
> > > all
> > > > > > > > > > > aggregations in
> > > > > > > > > > > one pass.
> > > > > > > > > > >
> > > > > > > > > > > Is there a way to do that?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Stefan
> > > > > > > > > > >
> > > > > > > > > > > [1]
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > > > > > To unsubscribe, e-mail:
> > > > > [email protected]
> > > > > > > > > > > For additional commands, e-mail:
> > > > > [email protected]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail:
> > > [email protected]
> > > > > > > > > For additional commands, e-mail:
> > > [email protected]
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> [email protected]
> > > > > > > For additional commands, e-mail:
> [email protected]
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [email protected]
> > > > > For additional commands, e-mail: [email protected]
> > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Computing multiple different aggregations over a match-set in one pass

Reply via email to