[
https://issues.apache.org/jira/browse/LUCENE-8928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936519#comment-16936519
]
Ignacio Vera commented on LUCENE-8928:
--------------------------------------
I have played a bit more with this idea and I wondered if we need to compute
exact bounds for every split. I modified [~jpountz] patch so instead of
computing the bounds for every split, it computes every N splits. This is
controlled by a static property called {{SPLITS_BEFORE_EXACT_BOUNDS}}.
The patch can be found here:
https://github.com/iverase/lucene-solr/commit/e63f8c73a86c46ec406143fcd0cb31a8371dfe63
My test show that setting this value to 4 (compute exact bounds every 4 splits)
reduces the indexing overhead to around 10% and keeps almost the same
performance as the previous approach. Maybe we can find a better heuristic to
set such value.
In addition, this patch does not apply for dimension <= 2 and the split
algorithm is reverted to the original one.
> BKDWriter could make splitting decisions based on the actual range of values
> ----------------------------------------------------------------------------
>
> Key: LUCENE-8928
> URL: https://issues.apache.org/jira/browse/LUCENE-8928
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
>
> Currently BKDWriter assumes that splitting on one dimension has no effect on
> values in other dimensions. While this may be ok for geo points, this is
> usually not true for ranges (or geo shapes, which are ranges too). Maybe we
> could get better indexing by re-computing the range of values on each
> dimension before making the choice of the split dimension?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]