Binning/Grouping large result sets efficiently

Matthias Mueller Tue, 21 Feb 2023 07:46:33 -0800

Hi,

I am still learning about the performance implications of Lucene's APIs when 
aggregating large
result sets. It seems that some cases require a deeper understanding of Lucenes 
internals and the
use of not-so-front-facing APIs.


For some time I am struggling with poor grouping/ aggregation performance on 
the following dataset:

* A sample of 600k locations (points) worldwide, pretty random distribution --> 
LatLon / Long Term
* A location type (restaurant, cinema, ...) --> String Term
* a few more properties for each location, mostly used for Filter queries --> 
various terms

Producing frequencies of location types ([restaurant: 23451], [cinema: 853], 
... ) is pretty fast
when using GroupSearch() and TopDocs (around 200ms).

Frequencies of aggregated locations are more tricky: In order to produce the 
grids, I have tried
GroupSearch() with a custum ValueSource that translates the location field into 
GeoTile / GeoHash
ID, so the GroupSearch can aggregate them to the desired grid level.

[cell=6/8/47, frequency=66],[cell=6/8/48, frequency=114],[cell=6/8/49, 
frequency=120],[cell=6/8/50,
frequency=120], ...

Unfortunately, this is aggregation pretty slow (takes 4 seconds with 3.8k 
bins). When profiling, I
can see that Lucene spends most of the time in lucene.util.PriorityQueue.

So I am looking for ways to speed this up. From what I have seen in the tests 
and examples, Lucene's
spatial indices (i.e. implementations of SpatialPrefixTree) already use GeoHash 
and Quadtree
encoding / prefix codes. Is there a way to leverage those for my task?

Is there related documentation in the Lucene ecosystem that I can study?


I am also interested in learning how to efficiently produce combined 
aggregations on cell and
location type, e.g.: [cell=6/8/47, type=restaurant, frequency=12],[cell=6/8/47, 
type=cinema,
frequency=2], ... 

Since sorting by two or more dimensions is possible, it should be possible to 
stream this
efficiently out of the indices, permitted Lucene provides APIs to do this. 
Right now, I am resorting
to LearReader, but that is probably the slowest of all options.



- Matthias

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Binning/Grouping large result sets efficiently

Reply via email to