Hi, I am still learning about the performance implications of Lucene's APIs when aggregating large result sets. It seems that some cases require a deeper understanding of Lucenes internals and the use of not-so-front-facing APIs.
For some time I am struggling with poor grouping/ aggregation performance on the following dataset: * A sample of 600k locations (points) worldwide, pretty random distribution --> LatLon / Long Term * A location type (restaurant, cinema, ...) --> String Term * a few more properties for each location, mostly used for Filter queries --> various terms Producing frequencies of location types ([restaurant: 23451], [cinema: 853], ... ) is pretty fast when using GroupSearch() and TopDocs (around 200ms). Frequencies of aggregated locations are more tricky: In order to produce the grids, I have tried GroupSearch() with a custum ValueSource that translates the location field into GeoTile / GeoHash ID, so the GroupSearch can aggregate them to the desired grid level. [cell=6/8/47, frequency=66],[cell=6/8/48, frequency=114],[cell=6/8/49, frequency=120],[cell=6/8/50, frequency=120], ... Unfortunately, this is aggregation pretty slow (takes 4 seconds with 3.8k bins). When profiling, I can see that Lucene spends most of the time in lucene.util.PriorityQueue. So I am looking for ways to speed this up. From what I have seen in the tests and examples, Lucene's spatial indices (i.e. implementations of SpatialPrefixTree) already use GeoHash and Quadtree encoding / prefix codes. Is there a way to leverage those for my task? Is there related documentation in the Lucene ecosystem that I can study? I am also interested in learning how to efficiently produce combined aggregations on cell and location type, e.g.: [cell=6/8/47, type=restaurant, frequency=12],[cell=6/8/47, type=cinema, frequency=2], ... Since sorting by two or more dimensions is possible, it should be possible to stream this efficiently out of the indices, permitted Lucene provides APIs to do this. Right now, I am resorting to LearReader, but that is probably the slowest of all options. - Matthias --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org