[ 
https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574980#comment-16574980
 ] 

Nicholas Knize commented on LUCENE-8452:
----------------------------------------

Thanks for putting this together [~ivera]. Its a great start to benchmarking 
BKD with data higher than two dimensions. In fact, I don't think we've ever 
fully benchmarked past two dimensions, and the current POINTS implementation 
supports up to eight.  For this particular geo usecase it would be nice to put 
together some shape benchmarks with more representative real world data, along 
with different relation queries, and add to the [nightly geo 
benchmarks|https://home.apache.org/~mikemccand/geobench.html]. A good 
representative set I've been looking at for some time is the [osm 
data|http://example.com] which contain a fair number of linestrings, closed 
linestrings, and polygons with number of vertices much higher than 4.

This first benchmarking step certainly highlights the deficiency of the BKD 
split and merge algorithms for high dimension data. Its a side effect of their 
naivety; which was chosen simply for ease of implementation. And indexing 
{{LatLonShape}} with only point and bounding box data is going to highlight 
those deficiencies quite spectacularly - if only due to the dimensional 
expansion issue (especially with point shapes). There are a lot of 
optimizations in the literature that could (and should) be applied for well 
past two dimensions. For example 
[x-tree|https://pdfs.semanticscholar.org/d1fd/9d524ca177e16b0909def97025feaeffac62.pdf]
 has some nice heuristics for avoiding the sliver problem that occurs much more 
frequently in high dimension data. And we could also investigate different 
encoding methods for {{LatLonShape}} 's tessellated triangles. For now, it 
would be good to capture performance across various datasets and carefully note 
in the javadocs what tradeoffs should be considered for certain datasets (i.e., 
don't use {{LatLonShape}} if you're only indexing/searching points). In the 
meantime, we can investigate best approaches for strengthening the BKD 
split/merge algorithms for the general high dimension use-case (similar to what 
was done for the [1d 
usecase|https://issues.apache.org/jira/browse/LUCENE-7396]) and possibly 
revisit the idea of a [new 
codec|https://issues.apache.org/jira/browse/LUCENE-7110] derived from BKD that 
is intended for the higher dimension usecase.

> BKD-based shape indexing benchmarks
> -----------------------------------
>
>                 Key: LUCENE-8452
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8452
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/sandbox
>            Reporter: Ignacio Vera
>            Priority: Major
>
> Initial benchmarking of the new BKD-based shape indexing suggest that 
> searches can be somewhat under-performing.   I open this ticket to share the 
> findings and to open a discussion how to speed up the solution.
>  
> The first benchmark is done by using the current benchmark in luceneutils for 
> indexing points and search by bounding box. We would expect {{LatLonShape}} 
> to be slower that {{LatLonPoint}} but still having a good performance. The 
> results of running such benchmark in my computer looks like:
>  
> LatLonPoint:
> 89.717239531 sec to index
> INDEX SIZE: 0.5087761553004384 GB
> READER MB: 0.6098232269287109
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 72.91056132596746
> BEST QPS: 74.19031323419311 
>  
> LatLonShape:
> 89.388678805 sec to index
> INDEX SIZE: 1.3028179928660393 GB
> READER MB: 0.8827085494995117
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 1.0053836784184809
> BEST QPS: 1.0230305276205143
>  
> A second benchmark has been performed indexing around 10 million 4-side 
> polygons and around 3 million points. Searches are performed using bounding 
> boxes. The results are compared with spatial trees alternatives. Spatial 
> trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25:
>  
> s2 (Geo3d):
> 1191.732124301 sec to index part 0
> INDEX SIZE: 3.2086284114047885 GB
> READER MB: 19.453557014465332
> maxDoc=12949519
> totHits=705758537
> BEST M hits/sec: 13.311369588840462
> BEST QPS: 4.243743434150063
>  
> quad (JTS):
> 3252.62925159 sec to index part 0
> INDEX SIZE: 4.5238002222031355 GB
> READER MB: 41.15725612640381
> maxDoc=12949519
> totHits=705758357
> BEST M hits/sec: 35.54591930673003
> BEST QPS: 11.332252412866938
>  
> LatLonShape:
> 30.32712009 sec to index part 0
> INDEX SIZE: 0.5627057952806354 GB
> READER MB: 0.29498958587646484
> maxDoc=12949519
> totHits=705758228
> BEST M hits/sec: 3.4130465326433357
> BEST QPS: 1.0880999177593018
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to