Re: Computing multiple different aggregations over a match-set in one pass

Stefan Vodita Fri, 24 Feb 2023 13:56:45 -0800

Hi everyone,

Greg and I discussed a bit offline. His assessment was right - I’m not looking
to compute multiple values per ordinal as an end in itself. That is only a means
to compute a single value which depends on other facet results. This
section from
the previous email explains it really well:


> For example, if we're using the geonames data you have in your example,
> maybe the value you want to associate with a given path is something like
> `max(population) + sum(elevation)`, where `max(population)` and 
> `sum(elevation)`
> are the result of two independent facet associations. Then, you could combine
> those results though some expression to derive a single value for a given 
> path.

Ideally, I could facet using an expression which binds other
aggregations. The user
experience might be as simple as defining the expression and making a single
faceting call. Has anyone worked on something similar?

Best,
Stefan

On Thu, 23 Feb 2023 at 16:53, Greg Miller <[email protected]> wrote:
>
> Thanks for the detailed benchmarking Stefan! I think you've demonstrated
> here that looping over the collected hits multiple times does in fact add
> meaningful overhead. That's interesting to see!
>
> As for whether-or-not to add functionality to the facets module that
> supports this, I'm not convinced at this point. I think what you're
> suggesting here—but please correct me if I'm wrong—is supporting
> association faceting where the user wants to compute multiple association
> aggregations for the same dimensions in a single pass. Where I'm struggling
> to connect a real-world use-case though is what the user is going to
> actually do with those multiple association values. The Facets API today (
> https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/Facets.java)
> has a pretty firm assumption built in that dimensions/paths have a single
> value associated with them. So building some sort of association faceting
> implementation that exposes more than one value associated with a given
> dimension/path is a significant change to the current model, and I'm not
> sure it supports enough real-world use to warrant the complexity.
>
> OK, now disclaimer: Stefan and I work together so I think I have an idea of
> what he's doing here...
>
> What I think you're actually after here—and the one use-case I could
> imagine some other users being interested in—is computing a single value
> for each dimension/path that is actually an expression over _other_
> aggregated values. For example, if we're using the geonames data you have
> in your example, maybe the value you want to associate with a given path is
> something like `max(population) + sum(elevation)`, where `max(population)`
> and `sum(elevation)` are the result of two independent facet associations.
> Then, you could combine those results though some expression to derive a
> single value for a given path. That end result still fits the Facets API
> well, but supporting something like this in Lucene requires a few other
> primitives beyond just the ability to compute multiple associations at the
> same time. Primarily, it needs some version of Expression + Bindings that
> works for dimensions/paths. So I don't think the ability to compute
> multiple associations at once is really the key missing feature here, and I
> don't think it adds significant value on its own to warrant the complexity
> of trying to expose it through the existings Facets API. Of course, there's
> nothing preventing users from building this "multiple association"
> functionality themselves.
>
> That's my take on this, but maybe I'm missing some other use-cases that
> could justify adding this capability in a general way? What do you think?
>
> Cheers,
> -Greg
>
> On Fri, Feb 17, 2023 at 3:14 PM Stefan Vodita <[email protected]>
> wrote:
>
> > After benchmarking my implementation against the existing one, I think
> > there is
> > some meaningful overhead. I built a small driver [1] that runs the two
> > solutions over
> > a geo data [2] index (thank you Greg for providing the indexing code!).
> >
> > The table below lists faceting times in milliseconds. I’ve named the
> > current
> > implementation serial and my proposal parallel, for lack of better names.
> > The
> > aggregation function is a no-op, so we’re only measuring the time spent
> > outside
> > aggregation. The measurements are over a match-set of 100k docs, but the
> > number
> > of docs does not have a large impact on the results because the aggregation
> > function isn’t doing any work.
> >
> > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > Faceting Time (ms) |
> >
> > |----------------------------------|------------------------------------|--------------------------------------|
> > | 2                                      |
> >        510 |                                      328 |
> > | 5                                      |
> >      1211 |                                      775 |
> > | 10                                    |
> >     2366 |                                    1301 |
> >
> > If we use a MAX aggregation over a DocValue instead, the results tell a
> > similar
> > story. In this case, the number of docs matters. I've attached results
> > for 10 docs and
> > 100k docs.
> >
> > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > Faceting Time (ms) |
> >
> > |----------------------------------|------------------------------------|--------------------------------------|
> > | 2                                      |
> >        706 |                                      505 |
> > | 5                                      |
> >      1618 |                                     1119 |
> > | 10                                    |
> >     3152 |                                    2018 |
> >
> > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > Faceting Time (ms) |
> >
> > |----------------------------------|------------------------------------|--------------------------------------|
> > | 2                                      |
> >        904 |                                      655 |
> > | 5                                      |
> >      2122 |                                    1491 |
> > | 10                                    |
> >     5062 |                                    3317 |
> >
> > With 10 aggregations, we're saving a second or more. That is significant
> > for my
> > use-case.
> >
> > I'd like to know if the test and results seem reasonable. If so, maybe
> > we can think
> > about providing this functionality.
> >
> > Thanks,
> > Stefan
> >
> > [1]
> > https://github.com/stefanvodita/lucene/commit/3536546cd9f833150db001e8eede093723cf7663
> > [2] https://download.geonames.org/export/dump/allCountries.zip
> >
> >
> > On Fri, 17 Feb 2023 at 18:45, Greg Miller <[email protected]> wrote:
> > >
> > > Thanks for the follow up Stefan. If you find significant overhead
> > > associated with the multiple iterations, please keep challenging the
> > > current approach and suggest improvements. It's always good to revisit
> > > these things!
> > >
> > > Cheers,
> > > -Greg
> > >
> > > On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita <[email protected]>
> > > wrote:
> > >
> > > > Hi Greg,
> > > >
> > > > To better understand how much work gets duplicated, I went ahead
> > > > and modified FloatTaxonomyFacets as an example [1]. It doesn't look
> > > > too pretty, but it illustrates how I think multiple aggregations in one
> > > > iteration could work.
> > > >
> > > > Overall, you're right, there's not as much wasted work as I had
> > > > expected. I'll try to do a performance comparison to quantify precisely
> > > > how much time we could save, just in case.
> > > >
> > > > Thank you the help!
> > > > Stefan
> > > >
> > > > [1]
> > > >
> > https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684
> > > >
> > > > On Wed, 15 Feb 2023 at 15:48, Greg Miller <[email protected]> wrote:
> > > > >
> > > > > Hi Stefan-
> > > > >
> > > > > > In that case, iterating twice duplicates most of the work, correct?
> > > > >
> > > > > I'm not sure I'd agree that it duplicates "most" of the work. This
> > is an
> > > > > association faceting example, which is a little bit of a special
> > case in
> > > > > some ways. But, to your question, there is duplicated work here of
> > > > > re-loading the ordinals across the two aggregations, but I would
> > suspect
> > > > > the more expensive work is actually computing the different
> > aggregations,
> > > > > which is not duplicated. You're right that it would likely be more
> > > > > efficient to iterate the hits once, loading the ordinals once and
> > > > computing
> > > > > multiple aggregations in one pass. There's no facility for doing that
> > > > > currently in Lucene's faceting module, but you could always propose
> > it!
> > > > :)
> > > > > That said, I'm not sure how common of a case this really is for the
> > > > > majority of users? But that's just a guess/assumption.
> > > > >
> > > > > Cheers,
> > > > > -Greg
> > > > >
> > > > > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <
> > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi Greg,
> > > > > >
> > > > > > I see now where my example didn’t give enough info. In my mind,
> > `Genre
> > > > /
> > > > > > Author nationality / Author name` is stored in one hierarchical
> > facet
> > > > > > field.
> > > > > > The data we’re aggregating over, like publish date or price, are
> > > > stored in
> > > > > > DocValues.
> > > > > >
> > > > > > The demo package shows something similar [1], where the aggregation
> > > > > > is computed across a facet field using data from a `popularity`
> > > > DocValue.
> > > > > >
> > > > > > In the demo, we compute `sum(_score * sqrt(popularity))`, but what
> > if
> > > > we
> > > > > > want several other different aggregations with respect to the same
> > > > facet
> > > > > > field? Maybe we want `max(popularity)`. In that case, iterating
> > twice
> > > > > > duplicates most of the work, correct?
> > > > > >
> > > > > >
> > > > > > Stefan
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> > https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> > > > > >
> > > > > > On Mon, 13 Feb 2023 at 22:46, Greg Miller <[email protected]>
> > wrote:
> > > > > > >
> > > > > > > Hi Stefan-
> > > > > > >
> > > > > > > That helps, thanks. I'm a bit confused about where you're
> > concerned
> > > > with
> > > > > > > iterating over the match set multiple times. Is this a situation
> > > > where
> > > > > > the
> > > > > > > ordinals you want to facet over are stored in different index
> > > > fields, so
> > > > > > > you have to create multiple Facets instances (one per field) to
> > > > compute
> > > > > > the
> > > > > > > aggregations? If that's the case, then yes—you have to iterate
> > over
> > > > the
> > > > > > > match set multiple times (once per field). I'm not sure that's
> > such
> > > > a big
> > > > > > > issue given that you're doing novel work during each iteration,
> > so
> > > > the
> > > > > > only
> > > > > > > repetitive cost is actually iterating the hits. If the ordinals
> > are
> > > > > > > "packed" into the same field though (which is the default in
> > Lucene
> > > > if
> > > > > > > you're using taxonomy faceting), then you should only need to do
> > a
> > > > single
> > > > > > > iteration over that field.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > -Greg
> > > > > > >
> > > > > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <
> > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Greg,
> > > > > > > >
> > > > > > > > I’m assuming we have one match-set which was not constrained
> > by any
> > > > > > > > of the categories we want to aggregate over, so it may have
> > books
> > > > by
> > > > > > > > Mark Twain, books by American authors, and sci-fi books.
> > > > > > > >
> > > > > > > > Maybe we can imagine we obtained it by searching for a
> > keyword, say
> > > > > > > > “Washington”, which is present in Mark Twain’s writing, and
> > those
> > > > of
> > > > > > other
> > > > > > > > American authors, and in sci-fi novels too.
> > > > > > > >
> > > > > > > > Does that make the example clearer?
> > > > > > > >
> > > > > > > >
> > > > > > > > Stefan
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <[email protected]>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Stefan-
> > > > > > > > >
> > > > > > > > > Can you clarify your example a little bit? It sounds like you
> > > > want to
> > > > > > > > facet
> > > > > > > > > over three different match sets (one constrained by "Mark
> > Twain"
> > > > as
> > > > > > the
> > > > > > > > > author, one constrained by "American authors" and one
> > > > constrained by
> > > > > > the
> > > > > > > > > "sci-fi" genre). Is that correct?
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > -Greg
> > > > > > > > >
> > > > > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Let’s say I have an index of books, similar to the example
> > in
> > > > the
> > > > > > facet
> > > > > > > > > > demo [1]
> > > > > > > > > > with a hierarchical facet field encapsulating `Genre /
> > Author’s
> > > > > > > > > > nationality /
> > > > > > > > > > Author’s name`.
> > > > > > > > > >
> > > > > > > > > > I might like to find the latest publish date of a book
> > written
> > > > by
> > > > > > Mark
> > > > > > > > > > Twain, the
> > > > > > > > > > sum of the prices of books written by American authors,
> > and the
> > > > > > number
> > > > > > > > of
> > > > > > > > > > sci-fi novels.
> > > > > > > > > >
> > > > > > > > > > As far as I understand, this would require faceting 3 times
> > > > over
> > > > > > the
> > > > > > > > > > match-set,
> > > > > > > > > > one iteration for each aggregation of a different type
> > > > (max(date),
> > > > > > > > > > sum(price),
> > > > > > > > > > count). That seems inefficient if we could instead compute
> > all
> > > > > > > > > > aggregations in
> > > > > > > > > > one pass.
> > > > > > > > > >
> > > > > > > > > > Is there a way to do that?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Stefan
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail:
> > > > [email protected]
> > > > > > > > > > For additional commands, e-mail:
> > > > [email protected]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail:
> > [email protected]
> > > > > > > > For additional commands, e-mail:
> > [email protected]
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > For additional commands, e-mail: [email protected]
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Computing multiple different aggregations over a match-set in one pass

Reply via email to