Hi everyone, I ended up using the idea of doing multiple aggregations in one go and it was a nice improvement. Maybe we can reconsider introducing this? I've opened an issue [1] and published a PR [2] based on the code I had previously shared, with some extra tests and a few improvements.
Stefan [1] https://github.com/apache/lucene/issues/12546 [2] https://github.com/apache/lucene/pull/12547 On Mon, 6 Mar 2023 at 19:46, Greg Miller <gsmil...@gmail.com> wrote: > Hi Stefan- > > I opened https://github.com/apache/lucene/issues/12190 where we can > discuss > this further. Thanks for raising the idea! > > Cheers, > -Greg > > On Mon, Mar 6, 2023 at 7:21 AM Stefan Vodita <stefan.vod...@gmail.com> > wrote: > > > Hi Greg, > > > > The PR looks great. I think it's a useful feature to have and it helps > > with the > > use-case we were discussing. I left a comment with some other ideas that > > I'd > > like to explore. > > > > Thank you for coding this up, > > Stefan > > > > On Sun, 5 Mar 2023 at 19:33, Greg Miller <gsmil...@gmail.com> wrote: > > > > > > Hi Stefan- > > > > > > I cobbled together a draft PR that I _think_ is what you're looking for > > so > > > we can have something to talk about. Please let me know if this misses > > the > > > mark, or is what you had in mind. If so, we could open an issue to > > propose > > > the idea of adding something like this. I'm not totally convinced I > like > > it > > > (I think the expression syntax/API is a little wonky), but that's > > something > > > we could discuss in an issue. > > > > > > https://github.com/apache/lucene/pull/12184 > > > > > > Cheers, > > > -Greg > > > > > > On Fri, Feb 24, 2023 at 1:57 PM Stefan Vodita <stefan.vod...@gmail.com > > > > > wrote: > > > > > > > Hi everyone, > > > > > > > > Greg and I discussed a bit offline. His assessment was right - I’m > not > > > > looking > > > > to compute multiple values per ordinal as an end in itself. That is > > only a > > > > means > > > > to compute a single value which depends on other facet results. This > > > > section from > > > > the previous email explains it really well: > > > > > > > > > For example, if we're using the geonames data you have in your > > example, > > > > > maybe the value you want to associate with a given path is > something > > like > > > > > `max(population) + sum(elevation)`, where `max(population)` and > > > > `sum(elevation)` > > > > > are the result of two independent facet associations. Then, you > could > > > > combine > > > > > those results though some expression to derive a single value for a > > > > given path. > > > > > > > > Ideally, I could facet using an expression which binds other > > > > aggregations. The user > > > > experience might be as simple as defining the expression and making a > > > > single > > > > faceting call. Has anyone worked on something similar? > > > > > > > > Best, > > > > Stefan > > > > > > > > On Thu, 23 Feb 2023 at 16:53, Greg Miller <gsmil...@gmail.com> > wrote: > > > > > > > > > > Thanks for the detailed benchmarking Stefan! I think you've > > demonstrated > > > > > here that looping over the collected hits multiple times does in > > fact add > > > > > meaningful overhead. That's interesting to see! > > > > > > > > > > As for whether-or-not to add functionality to the facets module > that > > > > > supports this, I'm not convinced at this point. I think what you're > > > > > suggesting here—but please correct me if I'm wrong—is supporting > > > > > association faceting where the user wants to compute multiple > > association > > > > > aggregations for the same dimensions in a single pass. Where I'm > > > > struggling > > > > > to connect a real-world use-case though is what the user is going > to > > > > > actually do with those multiple association values. The Facets API > > today > > > > ( > > > > > > > > > > > > https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/Facets.java > > > > ) > > > > > has a pretty firm assumption built in that dimensions/paths have a > > single > > > > > value associated with them. So building some sort of association > > faceting > > > > > implementation that exposes more than one value associated with a > > given > > > > > dimension/path is a significant change to the current model, and > I'm > > not > > > > > sure it supports enough real-world use to warrant the complexity. > > > > > > > > > > OK, now disclaimer: Stefan and I work together so I think I have an > > idea > > > > of > > > > > what he's doing here... > > > > > > > > > > What I think you're actually after here—and the one use-case I > could > > > > > imagine some other users being interested in—is computing a single > > value > > > > > for each dimension/path that is actually an expression over _other_ > > > > > aggregated values. For example, if we're using the geonames data > you > > have > > > > > in your example, maybe the value you want to associate with a given > > path > > > > is > > > > > something like `max(population) + sum(elevation)`, where > > > > `max(population)` > > > > > and `sum(elevation)` are the result of two independent facet > > > > associations. > > > > > Then, you could combine those results though some expression to > > derive a > > > > > single value for a given path. That end result still fits the > Facets > > API > > > > > well, but supporting something like this in Lucene requires a few > > other > > > > > primitives beyond just the ability to compute multiple associations > > at > > > > the > > > > > same time. Primarily, it needs some version of Expression + > Bindings > > that > > > > > works for dimensions/paths. So I don't think the ability to compute > > > > > multiple associations at once is really the key missing feature > here, > > > > and I > > > > > don't think it adds significant value on its own to warrant the > > > > complexity > > > > > of trying to expose it through the existings Facets API. Of course, > > > > there's > > > > > nothing preventing users from building this "multiple association" > > > > > functionality themselves. > > > > > > > > > > That's my take on this, but maybe I'm missing some other use-cases > > that > > > > > could justify adding this capability in a general way? What do you > > think? > > > > > > > > > > Cheers, > > > > > -Greg > > > > > > > > > > On Fri, Feb 17, 2023 at 3:14 PM Stefan Vodita < > > stefan.vod...@gmail.com> > > > > > wrote: > > > > > > > > > > > After benchmarking my implementation against the existing one, I > > think > > > > > > there is > > > > > > some meaningful overhead. I built a small driver [1] that runs > the > > two > > > > > > solutions over > > > > > > a geo data [2] index (thank you Greg for providing the indexing > > code!). > > > > > > > > > > > > The table below lists faceting times in milliseconds. I’ve named > > the > > > > > > current > > > > > > implementation serial and my proposal parallel, for lack of > better > > > > names. > > > > > > The > > > > > > aggregation function is a no-op, so we’re only measuring the time > > spent > > > > > > outside > > > > > > aggregation. The measurements are over a match-set of 100k docs, > > but > > > > the > > > > > > number > > > > > > of docs does not have a large impact on the results because the > > > > aggregation > > > > > > function isn’t doing any work. > > > > > > > > > > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel > > > > > > Faceting Time (ms) | > > > > > > > > > > > > > > > > > > > |----------------------------------|------------------------------------|--------------------------------------| > > > > > > | 2 | > > > > > > 510 | 328 | > > > > > > | 5 | > > > > > > 1211 | 775 | > > > > > > | 10 | > > > > > > 2366 | 1301 | > > > > > > > > > > > > If we use a MAX aggregation over a DocValue instead, the results > > tell a > > > > > > similar > > > > > > story. In this case, the number of docs matters. I've attached > > results > > > > > > for 10 docs and > > > > > > 100k docs. > > > > > > > > > > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel > > > > > > Faceting Time (ms) | > > > > > > > > > > > > > > > > > > > |----------------------------------|------------------------------------|--------------------------------------| > > > > > > | 2 | > > > > > > 706 | 505 | > > > > > > | 5 | > > > > > > 1618 | 1119 | > > > > > > | 10 | > > > > > > 3152 | 2018 | > > > > > > > > > > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel > > > > > > Faceting Time (ms) | > > > > > > > > > > > > > > > > > > > |----------------------------------|------------------------------------|--------------------------------------| > > > > > > | 2 | > > > > > > 904 | 655 | > > > > > > | 5 | > > > > > > 2122 | 1491 | > > > > > > | 10 | > > > > > > 5062 | 3317 | > > > > > > > > > > > > With 10 aggregations, we're saving a second or more. That is > > > > significant > > > > > > for my > > > > > > use-case. > > > > > > > > > > > > I'd like to know if the test and results seem reasonable. If so, > > maybe > > > > > > we can think > > > > > > about providing this functionality. > > > > > > > > > > > > Thanks, > > > > > > Stefan > > > > > > > > > > > > [1] > > > > > > > > > > > > > https://github.com/stefanvodita/lucene/commit/3536546cd9f833150db001e8eede093723cf7663 > > > > > > [2] https://download.geonames.org/export/dump/allCountries.zip > > > > > > > > > > > > > > > > > > On Fri, 17 Feb 2023 at 18:45, Greg Miller <gsmil...@gmail.com> > > wrote: > > > > > > > > > > > > > > Thanks for the follow up Stefan. If you find significant > overhead > > > > > > > associated with the multiple iterations, please keep > challenging > > the > > > > > > > current approach and suggest improvements. It's always good to > > > > revisit > > > > > > > these things! > > > > > > > > > > > > > > Cheers, > > > > > > > -Greg > > > > > > > > > > > > > > On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita < > > > > stefan.vod...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Greg, > > > > > > > > > > > > > > > > To better understand how much work gets duplicated, I went > > ahead > > > > > > > > and modified FloatTaxonomyFacets as an example [1]. It > doesn't > > look > > > > > > > > too pretty, but it illustrates how I think multiple > > aggregations > > > > in one > > > > > > > > iteration could work. > > > > > > > > > > > > > > > > Overall, you're right, there's not as much wasted work as I > had > > > > > > > > expected. I'll try to do a performance comparison to quantify > > > > precisely > > > > > > > > how much time we could save, just in case. > > > > > > > > > > > > > > > > Thank you the help! > > > > > > > > Stefan > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684 > > > > > > > > > > > > > > > > On Wed, 15 Feb 2023 at 15:48, Greg Miller < > gsmil...@gmail.com> > > > > wrote: > > > > > > > > > > > > > > > > > > Hi Stefan- > > > > > > > > > > > > > > > > > > > In that case, iterating twice duplicates most of the > work, > > > > correct? > > > > > > > > > > > > > > > > > > I'm not sure I'd agree that it duplicates "most" of the > work. > > > > This > > > > > > is an > > > > > > > > > association faceting example, which is a little bit of a > > special > > > > > > case in > > > > > > > > > some ways. But, to your question, there is duplicated work > > here > > > > of > > > > > > > > > re-loading the ordinals across the two aggregations, but I > > would > > > > > > suspect > > > > > > > > > the more expensive work is actually computing the different > > > > > > aggregations, > > > > > > > > > which is not duplicated. You're right that it would likely > be > > > > more > > > > > > > > > efficient to iterate the hits once, loading the ordinals > > once and > > > > > > > > computing > > > > > > > > > multiple aggregations in one pass. There's no facility for > > doing > > > > that > > > > > > > > > currently in Lucene's faceting module, but you could always > > > > propose > > > > > > it! > > > > > > > > :) > > > > > > > > > That said, I'm not sure how common of a case this really is > > for > > > > the > > > > > > > > > majority of users? But that's just a guess/assumption. > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > -Greg > > > > > > > > > > > > > > > > > > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita < > > > > > > stefan.vod...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi Greg, > > > > > > > > > > > > > > > > > > > > I see now where my example didn’t give enough info. In my > > mind, > > > > > > `Genre > > > > > > > > / > > > > > > > > > > Author nationality / Author name` is stored in one > > hierarchical > > > > > > facet > > > > > > > > > > field. > > > > > > > > > > The data we’re aggregating over, like publish date or > > price, > > > > are > > > > > > > > stored in > > > > > > > > > > DocValues. > > > > > > > > > > > > > > > > > > > > The demo package shows something similar [1], where the > > > > aggregation > > > > > > > > > > is computed across a facet field using data from a > > `popularity` > > > > > > > > DocValue. > > > > > > > > > > > > > > > > > > > > In the demo, we compute `sum(_score * sqrt(popularity))`, > > but > > > > what > > > > > > if > > > > > > > > we > > > > > > > > > > want several other different aggregations with respect to > > the > > > > same > > > > > > > > facet > > > > > > > > > > field? Maybe we want `max(popularity)`. In that case, > > iterating > > > > > > twice > > > > > > > > > > duplicates most of the work, correct? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Stefan > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91 > > > > > > > > > > > > > > > > > > > > On Mon, 13 Feb 2023 at 22:46, Greg Miller < > > gsmil...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > Hi Stefan- > > > > > > > > > > > > > > > > > > > > > > That helps, thanks. I'm a bit confused about where > you're > > > > > > concerned > > > > > > > > with > > > > > > > > > > > iterating over the match set multiple times. Is this a > > > > situation > > > > > > > > where > > > > > > > > > > the > > > > > > > > > > > ordinals you want to facet over are stored in different > > index > > > > > > > > fields, so > > > > > > > > > > > you have to create multiple Facets instances (one per > > field) > > > > to > > > > > > > > compute > > > > > > > > > > the > > > > > > > > > > > aggregations? If that's the case, then yes—you have to > > > > iterate > > > > > > over > > > > > > > > the > > > > > > > > > > > match set multiple times (once per field). I'm not sure > > > > that's > > > > > > such > > > > > > > > a big > > > > > > > > > > > issue given that you're doing novel work during each > > > > iteration, > > > > > > so > > > > > > > > the > > > > > > > > > > only > > > > > > > > > > > repetitive cost is actually iterating the hits. If the > > > > ordinals > > > > > > are > > > > > > > > > > > "packed" into the same field though (which is the > > default in > > > > > > Lucene > > > > > > > > if > > > > > > > > > > > you're using taxonomy faceting), then you should only > > need > > > > to do > > > > > > a > > > > > > > > single > > > > > > > > > > > iteration over that field. > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > -Greg > > > > > > > > > > > > > > > > > > > > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita < > > > > > > > > stefan.vod...@gmail.com> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Greg, > > > > > > > > > > > > > > > > > > > > > > > > I’m assuming we have one match-set which was not > > > > constrained > > > > > > by any > > > > > > > > > > > > of the categories we want to aggregate over, so it > may > > have > > > > > > books > > > > > > > > by > > > > > > > > > > > > Mark Twain, books by American authors, and sci-fi > > books. > > > > > > > > > > > > > > > > > > > > > > > > Maybe we can imagine we obtained it by searching for > a > > > > > > keyword, say > > > > > > > > > > > > “Washington”, which is present in Mark Twain’s > > writing, and > > > > > > those > > > > > > > > of > > > > > > > > > > other > > > > > > > > > > > > American authors, and in sci-fi novels too. > > > > > > > > > > > > > > > > > > > > > > > > Does that make the example clearer? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Stefan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller < > > > > gsmil...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Stefan- > > > > > > > > > > > > > > > > > > > > > > > > > > Can you clarify your example a little bit? It > sounds > > > > like you > > > > > > > > want to > > > > > > > > > > > > facet > > > > > > > > > > > > > over three different match sets (one constrained by > > "Mark > > > > > > Twain" > > > > > > > > as > > > > > > > > > > the > > > > > > > > > > > > > author, one constrained by "American authors" and > one > > > > > > > > constrained by > > > > > > > > > > the > > > > > > > > > > > > > "sci-fi" genre). Is that correct? > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > -Greg > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita < > > > > > > > > > > stefan.vod...@gmail.com> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > > > > > > > > > Let’s say I have an index of books, similar to > the > > > > example > > > > > > in > > > > > > > > the > > > > > > > > > > facet > > > > > > > > > > > > > > demo [1] > > > > > > > > > > > > > > with a hierarchical facet field encapsulating > > `Genre / > > > > > > Author’s > > > > > > > > > > > > > > nationality / > > > > > > > > > > > > > > Author’s name`. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I might like to find the latest publish date of a > > book > > > > > > written > > > > > > > > by > > > > > > > > > > Mark > > > > > > > > > > > > > > Twain, the > > > > > > > > > > > > > > sum of the prices of books written by American > > authors, > > > > > > and the > > > > > > > > > > number > > > > > > > > > > > > of > > > > > > > > > > > > > > sci-fi novels. > > > > > > > > > > > > > > > > > > > > > > > > > > > > As far as I understand, this would require > > faceting 3 > > > > times > > > > > > > > over > > > > > > > > > > the > > > > > > > > > > > > > > match-set, > > > > > > > > > > > > > > one iteration for each aggregation of a different > > type > > > > > > > > (max(date), > > > > > > > > > > > > > > sum(price), > > > > > > > > > > > > > > count). That seems inefficient if we could > instead > > > > compute > > > > > > all > > > > > > > > > > > > > > aggregations in > > > > > > > > > > > > > > one pass. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Is there a way to do that? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Stefan > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > > > > > > > To unsubscribe, e-mail: > > > > > > > > java-user-unsubscr...@lucene.apache.org > > > > > > > > > > > > > > For additional commands, e-mail: > > > > > > > > java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > > > > > To unsubscribe, e-mail: > > > > > > java-user-unsubscr...@lucene.apache.org > > > > > > > > > > > > For additional commands, e-mail: > > > > > > java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > > > To unsubscribe, e-mail: > > > > java-user-unsubscr...@lucene.apache.org > > > > > > > > > > For additional commands, e-mail: > > > > java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > To unsubscribe, e-mail: > > java-user-unsubscr...@lucene.apache.org > > > > > > > > For additional commands, e-mail: > > java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > > > For additional commands, e-mail: > java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >