Re: Richer Aggregations in Lucene

Adrien Grand Tue, 20 Jun 2023 14:03:26 -0700

Hey Shradha,

Such a contribution would be welcome. There is no good reason not to
support richer aggregations in Lucene. One thing that I have found
interesting with faceting/aggregations is that every implementation seems
to make different trade-offs, e.g.
 - Lucene's faceting historically required adding side-car data, but we
seem to want to make it work more and more with regular doc values instead
of the side-car index?
 - Both Lucene's faceting module and Solr (I think) load the set of matches
into a bitset first, and then compute facets against this bitset while
Elasticsearch computes aggregations within the collector.
 - Both Elasticsearch and Solr have composable aggregations, e.g. break
down by category, and then within each category by brand, but Lucene's
facets don't support this.


If you're going to build a new one, I have some suggestions:
 - Let's avoid dependencies on side-car indexes?
 - I don't think we should load matches into an int[] or BitSet. It takes
too much memory. However it's also true that collecting docs one-by-one
makes some things slower. Maybe we should look into doing
something in-between like batching computation of aggregations? This could
still allow taking advantage of e.g. vectorization if computing, say, the
average of a field.


On Fri, Jun 16, 2023 at 4:14 PM Shradha Shankar <[email protected]>
wrote:

> Hi Lucene devs,
>
> I work on product search at Amazon, where we use Lucene faceting
> to compute aggregations. There's a few functionalities I'm missing with
> faceting. For example, faceting will always aggregate all the way up to the
> dimension and it can't compute multiple aggregations in one pass of the
> match-set.
>
> Lucene-based search engines (like Elastic or OpenSearch) have feature-rich
> aggregation engines which allow different collection modes and give the
> user
> more control over the granularity of the scopes for which aggregations are
> computed.
>
> Are there historical reasons not to have this type of aggregation engine
> directly in Lucene? If it seems like a worthwhile idea to pursue, I've
> experimented a bit with how we could fulfill these needs in Lucene and I
> can
> open an issue/PR.
>
> Thanks,
> Shradha
>


-- 
Adrien

Re: Richer Aggregations in Lucene

Reply via email to