Re: Computing multiple different aggregations over a match-set in one pass

Stefan Vodita Mon, 06 Mar 2023 07:21:36 -0800

Hi Greg,

The PR looks great. I think it's a useful feature to have and it helps with the
use-case we were discussing. I left a comment with some other ideas that I'd
like to explore.


Thank you for coding this up,
Stefan

On Sun, 5 Mar 2023 at 19:33, Greg Miller <[email protected]> wrote:
>
> Hi Stefan-
>
> I cobbled together a draft PR that I _think_ is what you're looking for so
> we can have something to talk about. Please let me know if this misses the
> mark, or is what you had in mind. If so, we could open an issue to propose
> the idea of adding something like this. I'm not totally convinced I like it
> (I think the expression syntax/API is a little wonky), but that's something
> we could discuss in an issue.
>
> https://github.com/apache/lucene/pull/12184
>
> Cheers,
> -Greg
>
> On Fri, Feb 24, 2023 at 1:57 PM Stefan Vodita <[email protected]>
> wrote:
>
> > Hi everyone,
> >
> > Greg and I discussed a bit offline. His assessment was right - I’m not
> > looking
> > to compute multiple values per ordinal as an end in itself. That is only a
> > means
> > to compute a single value which depends on other facet results. This
> > section from
> > the previous email explains it really well:
> >
> > > For example, if we're using the geonames data you have in your example,
> > > maybe the value you want to associate with a given path is something like
> > > `max(population) + sum(elevation)`, where `max(population)` and
> > `sum(elevation)`
> > > are the result of two independent facet associations. Then, you could
> > combine
> > > those results though some expression to derive a single value for a
> > given path.
> >
> > Ideally, I could facet using an expression which binds other
> > aggregations. The user
> > experience might be as simple as defining the expression and making a
> > single
> > faceting call. Has anyone worked on something similar?
> >
> > Best,
> > Stefan
> >
> > On Thu, 23 Feb 2023 at 16:53, Greg Miller <[email protected]> wrote:
> > >
> > > Thanks for the detailed benchmarking Stefan! I think you've demonstrated
> > > here that looping over the collected hits multiple times does in fact add
> > > meaningful overhead. That's interesting to see!
> > >
> > > As for whether-or-not to add functionality to the facets module that
> > > supports this, I'm not convinced at this point. I think what you're
> > > suggesting here—but please correct me if I'm wrong—is supporting
> > > association faceting where the user wants to compute multiple association
> > > aggregations for the same dimensions in a single pass. Where I'm
> > struggling
> > > to connect a real-world use-case though is what the user is going to
> > > actually do with those multiple association values. The Facets API today
> > (
> > >
> > https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/Facets.java
> > )
> > > has a pretty firm assumption built in that dimensions/paths have a single
> > > value associated with them. So building some sort of association faceting
> > > implementation that exposes more than one value associated with a given
> > > dimension/path is a significant change to the current model, and I'm not
> > > sure it supports enough real-world use to warrant the complexity.
> > >
> > > OK, now disclaimer: Stefan and I work together so I think I have an idea
> > of
> > > what he's doing here...
> > >
> > > What I think you're actually after here—and the one use-case I could
> > > imagine some other users being interested in—is computing a single value
> > > for each dimension/path that is actually an expression over _other_
> > > aggregated values. For example, if we're using the geonames data you have
> > > in your example, maybe the value you want to associate with a given path
> > is
> > > something like `max(population) + sum(elevation)`, where
> > `max(population)`
> > > and `sum(elevation)` are the result of two independent facet
> > associations.
> > > Then, you could combine those results though some expression to derive a
> > > single value for a given path. That end result still fits the Facets API
> > > well, but supporting something like this in Lucene requires a few other
> > > primitives beyond just the ability to compute multiple associations at
> > the
> > > same time. Primarily, it needs some version of Expression + Bindings that
> > > works for dimensions/paths. So I don't think the ability to compute
> > > multiple associations at once is really the key missing feature here,
> > and I
> > > don't think it adds significant value on its own to warrant the
> > complexity
> > > of trying to expose it through the existings Facets API. Of course,
> > there's
> > > nothing preventing users from building this "multiple association"
> > > functionality themselves.
> > >
> > > That's my take on this, but maybe I'm missing some other use-cases that
> > > could justify adding this capability in a general way? What do you think?
> > >
> > > Cheers,
> > > -Greg
> > >
> > > On Fri, Feb 17, 2023 at 3:14 PM Stefan Vodita <[email protected]>
> > > wrote:
> > >
> > > > After benchmarking my implementation against the existing one, I think
> > > > there is
> > > > some meaningful overhead. I built a small driver [1] that runs the two
> > > > solutions over
> > > > a geo data [2] index (thank you Greg for providing the indexing code!).
> > > >
> > > > The table below lists faceting times in milliseconds. I’ve named the
> > > > current
> > > > implementation serial and my proposal parallel, for lack of better
> > names.
> > > > The
> > > > aggregation function is a no-op, so we’re only measuring the time spent
> > > > outside
> > > > aggregation. The measurements are over a match-set of 100k docs, but
> > the
> > > > number
> > > > of docs does not have a large impact on the results because the
> > aggregation
> > > > function isn’t doing any work.
> > > >
> > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > > Faceting Time (ms) |
> > > >
> > > >
> > |----------------------------------|------------------------------------|--------------------------------------|
> > > > | 2                                      |
> > > >        510 |                                      328 |
> > > > | 5                                      |
> > > >      1211 |                                      775 |
> > > > | 10                                    |
> > > >     2366 |                                    1301 |
> > > >
> > > > If we use a MAX aggregation over a DocValue instead, the results tell a
> > > > similar
> > > > story. In this case, the number of docs matters. I've attached results
> > > > for 10 docs and
> > > > 100k docs.
> > > >
> > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > > Faceting Time (ms) |
> > > >
> > > >
> > |----------------------------------|------------------------------------|--------------------------------------|
> > > > | 2                                      |
> > > >        706 |                                      505 |
> > > > | 5                                      |
> > > >      1618 |                                     1119 |
> > > > | 10                                    |
> > > >     3152 |                                    2018 |
> > > >
> > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > > Faceting Time (ms) |
> > > >
> > > >
> > |----------------------------------|------------------------------------|--------------------------------------|
> > > > | 2                                      |
> > > >        904 |                                      655 |
> > > > | 5                                      |
> > > >      2122 |                                    1491 |
> > > > | 10                                    |
> > > >     5062 |                                    3317 |
> > > >
> > > > With 10 aggregations, we're saving a second or more. That is
> > significant
> > > > for my
> > > > use-case.
> > > >
> > > > I'd like to know if the test and results seem reasonable. If so, maybe
> > > > we can think
> > > > about providing this functionality.
> > > >
> > > > Thanks,
> > > > Stefan
> > > >
> > > > [1]
> > > >
> > https://github.com/stefanvodita/lucene/commit/3536546cd9f833150db001e8eede093723cf7663
> > > > [2] https://download.geonames.org/export/dump/allCountries.zip
> > > >
> > > >
> > > > On Fri, 17 Feb 2023 at 18:45, Greg Miller <[email protected]> wrote:
> > > > >
> > > > > Thanks for the follow up Stefan. If you find significant overhead
> > > > > associated with the multiple iterations, please keep challenging the
> > > > > current approach and suggest improvements. It's always good to
> > revisit
> > > > > these things!
> > > > >
> > > > > Cheers,
> > > > > -Greg
> > > > >
> > > > > On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita <
> > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi Greg,
> > > > > >
> > > > > > To better understand how much work gets duplicated, I went ahead
> > > > > > and modified FloatTaxonomyFacets as an example [1]. It doesn't look
> > > > > > too pretty, but it illustrates how I think multiple aggregations
> > in one
> > > > > > iteration could work.
> > > > > >
> > > > > > Overall, you're right, there's not as much wasted work as I had
> > > > > > expected. I'll try to do a performance comparison to quantify
> > precisely
> > > > > > how much time we could save, just in case.
> > > > > >
> > > > > > Thank you the help!
> > > > > > Stefan
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> > https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684
> > > > > >
> > > > > > On Wed, 15 Feb 2023 at 15:48, Greg Miller <[email protected]>
> > wrote:
> > > > > > >
> > > > > > > Hi Stefan-
> > > > > > >
> > > > > > > > In that case, iterating twice duplicates most of the work,
> > correct?
> > > > > > >
> > > > > > > I'm not sure I'd agree that it duplicates "most" of the work.
> > This
> > > > is an
> > > > > > > association faceting example, which is a little bit of a special
> > > > case in
> > > > > > > some ways. But, to your question, there is duplicated work here
> > of
> > > > > > > re-loading the ordinals across the two aggregations, but I would
> > > > suspect
> > > > > > > the more expensive work is actually computing the different
> > > > aggregations,
> > > > > > > which is not duplicated. You're right that it would likely be
> > more
> > > > > > > efficient to iterate the hits once, loading the ordinals once and
> > > > > > computing
> > > > > > > multiple aggregations in one pass. There's no facility for doing
> > that
> > > > > > > currently in Lucene's faceting module, but you could always
> > propose
> > > > it!
> > > > > > :)
> > > > > > > That said, I'm not sure how common of a case this really is for
> > the
> > > > > > > majority of users? But that's just a guess/assumption.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > -Greg
> > > > > > >
> > > > > > > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <
> > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Greg,
> > > > > > > >
> > > > > > > > I see now where my example didn’t give enough info. In my mind,
> > > > `Genre
> > > > > > /
> > > > > > > > Author nationality / Author name` is stored in one hierarchical
> > > > facet
> > > > > > > > field.
> > > > > > > > The data we’re aggregating over, like publish date or price,
> > are
> > > > > > stored in
> > > > > > > > DocValues.
> > > > > > > >
> > > > > > > > The demo package shows something similar [1], where the
> > aggregation
> > > > > > > > is computed across a facet field using data from a `popularity`
> > > > > > DocValue.
> > > > > > > >
> > > > > > > > In the demo, we compute `sum(_score * sqrt(popularity))`, but
> > what
> > > > if
> > > > > > we
> > > > > > > > want several other different aggregations with respect to the
> > same
> > > > > > facet
> > > > > > > > field? Maybe we want `max(popularity)`. In that case, iterating
> > > > twice
> > > > > > > > duplicates most of the work, correct?
> > > > > > > >
> > > > > > > >
> > > > > > > > Stefan
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> > > > > > > >
> > > > > > > > On Mon, 13 Feb 2023 at 22:46, Greg Miller <[email protected]>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Stefan-
> > > > > > > > >
> > > > > > > > > That helps, thanks. I'm a bit confused about where you're
> > > > concerned
> > > > > > with
> > > > > > > > > iterating over the match set multiple times. Is this a
> > situation
> > > > > > where
> > > > > > > > the
> > > > > > > > > ordinals you want to facet over are stored in different index
> > > > > > fields, so
> > > > > > > > > you have to create multiple Facets instances (one per field)
> > to
> > > > > > compute
> > > > > > > > the
> > > > > > > > > aggregations? If that's the case, then yes—you have to
> > iterate
> > > > over
> > > > > > the
> > > > > > > > > match set multiple times (once per field). I'm not sure
> > that's
> > > > such
> > > > > > a big
> > > > > > > > > issue given that you're doing novel work during each
> > iteration,
> > > > so
> > > > > > the
> > > > > > > > only
> > > > > > > > > repetitive cost is actually iterating the hits. If the
> > ordinals
> > > > are
> > > > > > > > > "packed" into the same field though (which is the default in
> > > > Lucene
> > > > > > if
> > > > > > > > > you're using taxonomy faceting), then you should only need
> > to do
> > > > a
> > > > > > single
> > > > > > > > > iteration over that field.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > -Greg
> > > > > > > > >
> > > > > > > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <
> > > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Greg,
> > > > > > > > > >
> > > > > > > > > > I’m assuming we have one match-set which was not
> > constrained
> > > > by any
> > > > > > > > > > of the categories we want to aggregate over, so it may have
> > > > books
> > > > > > by
> > > > > > > > > > Mark Twain, books by American authors, and sci-fi books.
> > > > > > > > > >
> > > > > > > > > > Maybe we can imagine we obtained it by searching for a
> > > > keyword, say
> > > > > > > > > > “Washington”, which is present in Mark Twain’s writing, and
> > > > those
> > > > > > of
> > > > > > > > other
> > > > > > > > > > American authors, and in sci-fi novels too.
> > > > > > > > > >
> > > > > > > > > > Does that make the example clearer?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Stefan
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <
> > [email protected]>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi Stefan-
> > > > > > > > > > >
> > > > > > > > > > > Can you clarify your example a little bit? It sounds
> > like you
> > > > > > want to
> > > > > > > > > > facet
> > > > > > > > > > > over three different match sets (one constrained by "Mark
> > > > Twain"
> > > > > > as
> > > > > > > > the
> > > > > > > > > > > author, one constrained by "American authors" and one
> > > > > > constrained by
> > > > > > > > the
> > > > > > > > > > > "sci-fi" genre). Is that correct?
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > -Greg
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > > > > > > > [email protected]>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi all,
> > > > > > > > > > > >
> > > > > > > > > > > > Let’s say I have an index of books, similar to the
> > example
> > > > in
> > > > > > the
> > > > > > > > facet
> > > > > > > > > > > > demo [1]
> > > > > > > > > > > > with a hierarchical facet field encapsulating `Genre /
> > > > Author’s
> > > > > > > > > > > > nationality /
> > > > > > > > > > > > Author’s name`.
> > > > > > > > > > > >
> > > > > > > > > > > > I might like to find the latest publish date of a book
> > > > written
> > > > > > by
> > > > > > > > Mark
> > > > > > > > > > > > Twain, the
> > > > > > > > > > > > sum of the prices of books written by American authors,
> > > > and the
> > > > > > > > number
> > > > > > > > > > of
> > > > > > > > > > > > sci-fi novels.
> > > > > > > > > > > >
> > > > > > > > > > > > As far as I understand, this would require faceting 3
> > times
> > > > > > over
> > > > > > > > the
> > > > > > > > > > > > match-set,
> > > > > > > > > > > > one iteration for each aggregation of a different type
> > > > > > (max(date),
> > > > > > > > > > > > sum(price),
> > > > > > > > > > > > count). That seems inefficient if we could instead
> > compute
> > > > all
> > > > > > > > > > > > aggregations in
> > > > > > > > > > > > one pass.
> > > > > > > > > > > >
> > > > > > > > > > > > Is there a way to do that?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Stefan
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > > > > > To unsubscribe, e-mail:
> > > > > > [email protected]
> > > > > > > > > > > > For additional commands, e-mail:
> > > > > > [email protected]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail:
> > > > [email protected]
> > > > > > > > > > For additional commands, e-mail:
> > > > [email protected]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail:
> > [email protected]
> > > > > > > > For additional commands, e-mail:
> > [email protected]
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > For additional commands, e-mail: [email protected]
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Computing multiple different aggregations over a match-set in one pass

Reply via email to