[
https://issues.apache.org/jira/browse/SOLR-17373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867017#comment-17867017
]
David Smiley commented on SOLR-17373:
-------------------------------------
Also or separately, I think the computed prefix histogram should be filtered so
as to ensure that each prefix has at least one non-deleted doc. This should be
fairly cheap and simple, and addresses the particularly egregious scenario we
encountered.
> Shard splitByPrefix should not do so if it would be too imbalanced/inefficient
> ------------------------------------------------------------------------------
>
> Key: SOLR-17373
> URL: https://issues.apache.org/jira/browse/SOLR-17373
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud
> Reporter: David Smiley
> Priority: Major
>
> Shard split "splitByPrefix" exists to reduce the number of shards that a
> typical prefix is in, thus reducing query fanout distributed search (assuming
> the route param is used), and it can isolate indexing activity as well.
> Sometimes this can result in a very imbalanced (in-efficient) shard split
> that may even quickly lead to another split back-to-back! (imagine splitting
> off less than 1%). Here we propose that if the split would only split off <
> 20% of docs or so, then it's too inefficient. Instead, split the middle of
> the largest key prefix.
> Note: it's also been observed that a prefix might be so extremely low
> represented that it's likely those docs are marked deleted as part of a
> previous shard split (if "link" split method). Thus this inefficiency can
> have a cascading badness effect.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]