[
https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574980#comment-16574980
]
Nicholas Knize commented on LUCENE-8452:
----------------------------------------
Thanks for putting this together [~ivera]. Its a great start to benchmarking
BKD with data higher than two dimensions. In fact, I don't think we've ever
fully benchmarked past two dimensions, and the current POINTS implementation
supports up to eight. For this particular geo usecase it would be nice to put
together some shape benchmarks with more representative real world data, along
with different relation queries, and add to the [nightly geo
benchmarks|https://home.apache.org/~mikemccand/geobench.html]. A good
representative set I've been looking at for some time is the [osm
data|http://example.com] which contain a fair number of linestrings, closed
linestrings, and polygons with number of vertices much higher than 4.
This first benchmarking step certainly highlights the deficiency of the BKD
split and merge algorithms for high dimension data. Its a side effect of their
naivety; which was chosen simply for ease of implementation. And indexing
{{LatLonShape}} with only point and bounding box data is going to highlight
those deficiencies quite spectacularly - if only due to the dimensional
expansion issue (especially with point shapes). There are a lot of
optimizations in the literature that could (and should) be applied for well
past two dimensions. For example
[x-tree|https://pdfs.semanticscholar.org/d1fd/9d524ca177e16b0909def97025feaeffac62.pdf]
has some nice heuristics for avoiding the sliver problem that occurs much more
frequently in high dimension data. And we could also investigate different
encoding methods for {{LatLonShape}} 's tessellated triangles. For now, it
would be good to capture performance across various datasets and carefully note
in the javadocs what tradeoffs should be considered for certain datasets (i.e.,
don't use {{LatLonShape}} if you're only indexing/searching points). In the
meantime, we can investigate best approaches for strengthening the BKD
split/merge algorithms for the general high dimension use-case (similar to what
was done for the [1d
usecase|https://issues.apache.org/jira/browse/LUCENE-7396]) and possibly
revisit the idea of a [new
codec|https://issues.apache.org/jira/browse/LUCENE-7110] derived from BKD that
is intended for the higher dimension usecase.
> BKD-based shape indexing benchmarks
> -----------------------------------
>
> Key: LUCENE-8452
> URL: https://issues.apache.org/jira/browse/LUCENE-8452
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/sandbox
> Reporter: Ignacio Vera
> Priority: Major
>
> Initial benchmarking of the new BKD-based shape indexing suggest that
> searches can be somewhat under-performing. I open this ticket to share the
> findings and to open a discussion how to speed up the solution.
>
> The first benchmark is done by using the current benchmark in luceneutils for
> indexing points and search by bounding box. We would expect {{LatLonShape}}
> to be slower that {{LatLonPoint}} but still having a good performance. The
> results of running such benchmark in my computer looks like:
>
> LatLonPoint:
> 89.717239531 sec to index
> INDEX SIZE: 0.5087761553004384 GB
> READER MB: 0.6098232269287109
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 72.91056132596746
> BEST QPS: 74.19031323419311
>
> LatLonShape:
> 89.388678805 sec to index
> INDEX SIZE: 1.3028179928660393 GB
> READER MB: 0.8827085494995117
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 1.0053836784184809
> BEST QPS: 1.0230305276205143
>
> A second benchmark has been performed indexing around 10 million 4-side
> polygons and around 3 million points. Searches are performed using bounding
> boxes. The results are compared with spatial trees alternatives. Spatial
> trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25:
>
> s2 (Geo3d):
> 1191.732124301 sec to index part 0
> INDEX SIZE: 3.2086284114047885 GB
> READER MB: 19.453557014465332
> maxDoc=12949519
> totHits=705758537
> BEST M hits/sec: 13.311369588840462
> BEST QPS: 4.243743434150063
>
> quad (JTS):
> 3252.62925159 sec to index part 0
> INDEX SIZE: 4.5238002222031355 GB
> READER MB: 41.15725612640381
> maxDoc=12949519
> totHits=705758357
> BEST M hits/sec: 35.54591930673003
> BEST QPS: 11.332252412866938
>
> LatLonShape:
> 30.32712009 sec to index part 0
> INDEX SIZE: 0.5627057952806354 GB
> READER MB: 0.29498958587646484
> maxDoc=12949519
> totHits=705758228
> BEST M hits/sec: 3.4130465326433357
> BEST QPS: 1.0880999177593018
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]