Hi everyone, Greg and I discussed a bit offline. His assessment was right - I’m not looking to compute multiple values per ordinal as an end in itself. That is only a means to compute a single value which depends on other facet results. This section from the previous email explains it really well:
> For example, if we're using the geonames data you have in your example, > maybe the value you want to associate with a given path is something like > `max(population) + sum(elevation)`, where `max(population)` and > `sum(elevation)` > are the result of two independent facet associations. Then, you could combine > those results though some expression to derive a single value for a given > path. Ideally, I could facet using an expression which binds other aggregations. The user experience might be as simple as defining the expression and making a single faceting call. Has anyone worked on something similar? Best, Stefan On Thu, 23 Feb 2023 at 16:53, Greg Miller <gsmil...@gmail.com> wrote: > > Thanks for the detailed benchmarking Stefan! I think you've demonstrated > here that looping over the collected hits multiple times does in fact add > meaningful overhead. That's interesting to see! > > As for whether-or-not to add functionality to the facets module that > supports this, I'm not convinced at this point. I think what you're > suggesting here—but please correct me if I'm wrong—is supporting > association faceting where the user wants to compute multiple association > aggregations for the same dimensions in a single pass. Where I'm struggling > to connect a real-world use-case though is what the user is going to > actually do with those multiple association values. The Facets API today ( > https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/Facets.java) > has a pretty firm assumption built in that dimensions/paths have a single > value associated with them. So building some sort of association faceting > implementation that exposes more than one value associated with a given > dimension/path is a significant change to the current model, and I'm not > sure it supports enough real-world use to warrant the complexity. > > OK, now disclaimer: Stefan and I work together so I think I have an idea of > what he's doing here... > > What I think you're actually after here—and the one use-case I could > imagine some other users being interested in—is computing a single value > for each dimension/path that is actually an expression over _other_ > aggregated values. For example, if we're using the geonames data you have > in your example, maybe the value you want to associate with a given path is > something like `max(population) + sum(elevation)`, where `max(population)` > and `sum(elevation)` are the result of two independent facet associations. > Then, you could combine those results though some expression to derive a > single value for a given path. That end result still fits the Facets API > well, but supporting something like this in Lucene requires a few other > primitives beyond just the ability to compute multiple associations at the > same time. Primarily, it needs some version of Expression + Bindings that > works for dimensions/paths. So I don't think the ability to compute > multiple associations at once is really the key missing feature here, and I > don't think it adds significant value on its own to warrant the complexity > of trying to expose it through the existings Facets API. Of course, there's > nothing preventing users from building this "multiple association" > functionality themselves. > > That's my take on this, but maybe I'm missing some other use-cases that > could justify adding this capability in a general way? What do you think? > > Cheers, > -Greg > > On Fri, Feb 17, 2023 at 3:14 PM Stefan Vodita <stefan.vod...@gmail.com> > wrote: > > > After benchmarking my implementation against the existing one, I think > > there is > > some meaningful overhead. I built a small driver [1] that runs the two > > solutions over > > a geo data [2] index (thank you Greg for providing the indexing code!). > > > > The table below lists faceting times in milliseconds. I’ve named the > > current > > implementation serial and my proposal parallel, for lack of better names. > > The > > aggregation function is a no-op, so we’re only measuring the time spent > > outside > > aggregation. The measurements are over a match-set of 100k docs, but the > > number > > of docs does not have a large impact on the results because the aggregation > > function isn’t doing any work. > > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel > > Faceting Time (ms) | > > > > |----------------------------------|------------------------------------|--------------------------------------| > > | 2 | > > 510 | 328 | > > | 5 | > > 1211 | 775 | > > | 10 | > > 2366 | 1301 | > > > > If we use a MAX aggregation over a DocValue instead, the results tell a > > similar > > story. In this case, the number of docs matters. I've attached results > > for 10 docs and > > 100k docs. > > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel > > Faceting Time (ms) | > > > > |----------------------------------|------------------------------------|--------------------------------------| > > | 2 | > > 706 | 505 | > > | 5 | > > 1618 | 1119 | > > | 10 | > > 3152 | 2018 | > > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel > > Faceting Time (ms) | > > > > |----------------------------------|------------------------------------|--------------------------------------| > > | 2 | > > 904 | 655 | > > | 5 | > > 2122 | 1491 | > > | 10 | > > 5062 | 3317 | > > > > With 10 aggregations, we're saving a second or more. That is significant > > for my > > use-case. > > > > I'd like to know if the test and results seem reasonable. If so, maybe > > we can think > > about providing this functionality. > > > > Thanks, > > Stefan > > > > [1] > > https://github.com/stefanvodita/lucene/commit/3536546cd9f833150db001e8eede093723cf7663 > > [2] https://download.geonames.org/export/dump/allCountries.zip > > > > > > On Fri, 17 Feb 2023 at 18:45, Greg Miller <gsmil...@gmail.com> wrote: > > > > > > Thanks for the follow up Stefan. If you find significant overhead > > > associated with the multiple iterations, please keep challenging the > > > current approach and suggest improvements. It's always good to revisit > > > these things! > > > > > > Cheers, > > > -Greg > > > > > > On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita <stefan.vod...@gmail.com> > > > wrote: > > > > > > > Hi Greg, > > > > > > > > To better understand how much work gets duplicated, I went ahead > > > > and modified FloatTaxonomyFacets as an example [1]. It doesn't look > > > > too pretty, but it illustrates how I think multiple aggregations in one > > > > iteration could work. > > > > > > > > Overall, you're right, there's not as much wasted work as I had > > > > expected. I'll try to do a performance comparison to quantify precisely > > > > how much time we could save, just in case. > > > > > > > > Thank you the help! > > > > Stefan > > > > > > > > [1] > > > > > > https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684 > > > > > > > > On Wed, 15 Feb 2023 at 15:48, Greg Miller <gsmil...@gmail.com> wrote: > > > > > > > > > > Hi Stefan- > > > > > > > > > > > In that case, iterating twice duplicates most of the work, correct? > > > > > > > > > > I'm not sure I'd agree that it duplicates "most" of the work. This > > is an > > > > > association faceting example, which is a little bit of a special > > case in > > > > > some ways. But, to your question, there is duplicated work here of > > > > > re-loading the ordinals across the two aggregations, but I would > > suspect > > > > > the more expensive work is actually computing the different > > aggregations, > > > > > which is not duplicated. You're right that it would likely be more > > > > > efficient to iterate the hits once, loading the ordinals once and > > > > computing > > > > > multiple aggregations in one pass. There's no facility for doing that > > > > > currently in Lucene's faceting module, but you could always propose > > it! > > > > :) > > > > > That said, I'm not sure how common of a case this really is for the > > > > > majority of users? But that's just a guess/assumption. > > > > > > > > > > Cheers, > > > > > -Greg > > > > > > > > > > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita < > > stefan.vod...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi Greg, > > > > > > > > > > > > I see now where my example didn’t give enough info. In my mind, > > `Genre > > > > / > > > > > > Author nationality / Author name` is stored in one hierarchical > > facet > > > > > > field. > > > > > > The data we’re aggregating over, like publish date or price, are > > > > stored in > > > > > > DocValues. > > > > > > > > > > > > The demo package shows something similar [1], where the aggregation > > > > > > is computed across a facet field using data from a `popularity` > > > > DocValue. > > > > > > > > > > > > In the demo, we compute `sum(_score * sqrt(popularity))`, but what > > if > > > > we > > > > > > want several other different aggregations with respect to the same > > > > facet > > > > > > field? Maybe we want `max(popularity)`. In that case, iterating > > twice > > > > > > duplicates most of the work, correct? > > > > > > > > > > > > > > > > > > Stefan > > > > > > > > > > > > [1] > > > > > > > > > > > > https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91 > > > > > > > > > > > > On Mon, 13 Feb 2023 at 22:46, Greg Miller <gsmil...@gmail.com> > > wrote: > > > > > > > > > > > > > > Hi Stefan- > > > > > > > > > > > > > > That helps, thanks. I'm a bit confused about where you're > > concerned > > > > with > > > > > > > iterating over the match set multiple times. Is this a situation > > > > where > > > > > > the > > > > > > > ordinals you want to facet over are stored in different index > > > > fields, so > > > > > > > you have to create multiple Facets instances (one per field) to > > > > compute > > > > > > the > > > > > > > aggregations? If that's the case, then yes—you have to iterate > > over > > > > the > > > > > > > match set multiple times (once per field). I'm not sure that's > > such > > > > a big > > > > > > > issue given that you're doing novel work during each iteration, > > so > > > > the > > > > > > only > > > > > > > repetitive cost is actually iterating the hits. If the ordinals > > are > > > > > > > "packed" into the same field though (which is the default in > > Lucene > > > > if > > > > > > > you're using taxonomy faceting), then you should only need to do > > a > > > > single > > > > > > > iteration over that field. > > > > > > > > > > > > > > Cheers, > > > > > > > -Greg > > > > > > > > > > > > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita < > > > > stefan.vod...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Greg, > > > > > > > > > > > > > > > > I’m assuming we have one match-set which was not constrained > > by any > > > > > > > > of the categories we want to aggregate over, so it may have > > books > > > > by > > > > > > > > Mark Twain, books by American authors, and sci-fi books. > > > > > > > > > > > > > > > > Maybe we can imagine we obtained it by searching for a > > keyword, say > > > > > > > > “Washington”, which is present in Mark Twain’s writing, and > > those > > > > of > > > > > > other > > > > > > > > American authors, and in sci-fi novels too. > > > > > > > > > > > > > > > > Does that make the example clearer? > > > > > > > > > > > > > > > > > > > > > > > > Stefan > > > > > > > > > > > > > > > > > > > > > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <gsmil...@gmail.com> > > > > wrote: > > > > > > > > > > > > > > > > > > Hi Stefan- > > > > > > > > > > > > > > > > > > Can you clarify your example a little bit? It sounds like you > > > > want to > > > > > > > > facet > > > > > > > > > over three different match sets (one constrained by "Mark > > Twain" > > > > as > > > > > > the > > > > > > > > > author, one constrained by "American authors" and one > > > > constrained by > > > > > > the > > > > > > > > > "sci-fi" genre). Is that correct? > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > -Greg > > > > > > > > > > > > > > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita < > > > > > > stefan.vod...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > Let’s say I have an index of books, similar to the example > > in > > > > the > > > > > > facet > > > > > > > > > > demo [1] > > > > > > > > > > with a hierarchical facet field encapsulating `Genre / > > Author’s > > > > > > > > > > nationality / > > > > > > > > > > Author’s name`. > > > > > > > > > > > > > > > > > > > > I might like to find the latest publish date of a book > > written > > > > by > > > > > > Mark > > > > > > > > > > Twain, the > > > > > > > > > > sum of the prices of books written by American authors, > > and the > > > > > > number > > > > > > > > of > > > > > > > > > > sci-fi novels. > > > > > > > > > > > > > > > > > > > > As far as I understand, this would require faceting 3 times > > > > over > > > > > > the > > > > > > > > > > match-set, > > > > > > > > > > one iteration for each aggregation of a different type > > > > (max(date), > > > > > > > > > > sum(price), > > > > > > > > > > count). That seems inefficient if we could instead compute > > all > > > > > > > > > > aggregations in > > > > > > > > > > one pass. > > > > > > > > > > > > > > > > > > > > Is there a way to do that? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Stefan > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > > > To unsubscribe, e-mail: > > > > java-user-unsubscr...@lucene.apache.org > > > > > > > > > > For additional commands, e-mail: > > > > java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > To unsubscribe, e-mail: > > java-user-unsubscr...@lucene.apache.org > > > > > > > > For additional commands, e-mail: > > java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org