[ https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16575338#comment-16575338 ]
Nicholas Knize commented on LUCENE-8452: ---------------------------------------- bq. the dimensions are not independent. That's what I mean by "...due to the dimensional expansion issue (especially with point shapes)." For point shapes, {{LatLonShape}} does the opposite of Principal Component Analysis. Instead of reducing dimensionality it expands it from two to six and subsequently perfectly correlates dimensions 1, 3, 5 and 2, 4, 6. It is not at all optimal and not at all expected to perform anywhere close to a structure optimized for true two dimension point space. In fact it goes against every concept of a optimal structure. bq. to see how the BKD tree behaves for different dimensions As I mentioned, the split/merge algorithms for BKD are naive for ease of implementation. Given the lack of heuristics along dimensions I'd expect QPS/HPS to drop as more dimensions are added. Add a high level of correlation between dimensions and I'd expect it to get even worse. Certainly, for high dimension indexing, work will need to be done to optimize the BKD structure. Alternatively, we can work on a separate codec (possibly a R*/X/PR-tree hybrid) derived from the current BKD implementation that is designed for higher dimension space - since the kd structure itself was not designed for such applications. And subsequently restrict BKD to two dimensions only. > BKD-based shape indexing benchmarks > ----------------------------------- > > Key: LUCENE-8452 > URL: https://issues.apache.org/jira/browse/LUCENE-8452 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox > Reporter: Ignacio Vera > Priority: Major > Attachments: BKDperf.pdf > > > Initial benchmarking of the new BKD-based shape indexing suggest that > searches can be somewhat under-performing. I open this ticket to share the > findings and to open a discussion how to speed up the solution. > > The first benchmark is done by using the current benchmark in luceneutils for > indexing points and search by bounding box. We would expect {{LatLonShape}} > to be slower that {{LatLonPoint}} but still having a good performance. The > results of running such benchmark in my computer looks like: > > LatLonPoint: > 89.717239531 sec to index > INDEX SIZE: 0.5087761553004384 GB > READER MB: 0.6098232269287109 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 72.91056132596746 > BEST QPS: 74.19031323419311 > > LatLonShape: > 89.388678805 sec to index > INDEX SIZE: 1.3028179928660393 GB > READER MB: 0.8827085494995117 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 1.0053836784184809 > BEST QPS: 1.0230305276205143 > > A second benchmark has been performed indexing around 10 million 4-side > polygons and around 3 million points. Searches are performed using bounding > boxes. The results are compared with spatial trees alternatives. Spatial > trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25: > > s2 (Geo3d): > 1191.732124301 sec to index part 0 > INDEX SIZE: 3.2086284114047885 GB > READER MB: 19.453557014465332 > maxDoc=12949519 > totHits=705758537 > BEST M hits/sec: 13.311369588840462 > BEST QPS: 4.243743434150063 > > quad (JTS): > 3252.62925159 sec to index part 0 > INDEX SIZE: 4.5238002222031355 GB > READER MB: 41.15725612640381 > maxDoc=12949519 > totHits=705758357 > BEST M hits/sec: 35.54591930673003 > BEST QPS: 11.332252412866938 > > LatLonShape: > 30.32712009 sec to index part 0 > INDEX SIZE: 0.5627057952806354 GB > READER MB: 0.29498958587646484 > maxDoc=12949519 > totHits=705758228 > BEST M hits/sec: 3.4130465326433357 > BEST QPS: 1.0880999177593018 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org