[ 
https://issues.apache.org/jira/browse/SOLR-17373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867017#comment-17867017
 ] 

David Smiley commented on SOLR-17373:
-------------------------------------

Also or separately, I think the computed prefix histogram should be filtered so 
as to ensure that each prefix has at least one non-deleted doc.  This should be 
fairly cheap and simple, and addresses the particularly egregious scenario we 
encountered.

> Shard splitByPrefix should not do so if it would be too imbalanced/inefficient
> ------------------------------------------------------------------------------
>
>                 Key: SOLR-17373
>                 URL: https://issues.apache.org/jira/browse/SOLR-17373
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: David Smiley
>            Priority: Major
>
> Shard split "splitByPrefix" exists to reduce the number of shards that a 
> typical prefix is in, thus reducing query fanout distributed search (assuming 
> the route param is used), and it can isolate indexing activity as well.  
> Sometimes this can result in a very imbalanced (in-efficient) shard split 
> that may even quickly lead to another split back-to-back!  (imagine splitting 
> off less than 1%).  Here we propose that if the split would only split off < 
> 20% of docs or so, then it's too inefficient.  Instead, split the middle of 
> the largest key prefix.
> Note: it's also been observed that a prefix might be so extremely low 
> represented that it's likely those docs are marked deleted as part of a 
> previous shard split (if "link" split method).  Thus this inefficiency can 
> have a cascading badness effect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to