[jira] [Commented] (SOLR-14347) Autoscaling placement wrong when concurrent replica placements are calculated

Ilan Ginzburg (Jira) Mon, 11 May 2020 08:56:11 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17104615#comment-17104615
 ]


Ilan Ginzburg commented on SOLR-14347:
--------------------------------------

[~ab], something I don't understand in the fix to this bug.

In PolicyHelper.getReplicaLocations(), placement computation is done on a new 
session:
{{Policy.Session session = new Policy.Session(delegatingManager, 
origSession.policy, origSession.transaction);}}

That session does not contain previous placement decisions captured in 
origSession, it gets its state from the cluster state provider (i.e. Zookeeper).
Placement decisions are therefore made on current Zookeeper state, not 
including changes made by commands that haven't finished yet (which I believe 
is the _raison d'être_ of having Sessions cached and reused).

At the end of the method, we return to the SessionWrapper the original session, 
not the one in which we computed new placement decisions (not that it really 
matters given we're not using them anyway if I understood that right).

The above would mean that placement decisions are really not visible until they 
get persisted in Zookeeper and can be read back from there.
This would totally explain why heavy load collection creation leads to 
completely unbalanced clusters, and why doing things slowly works well.

[~noble.paul] FYI, it is related to observations made in SOLR-14462

> Autoscaling placement wrong when concurrent replica placements are calculated
> -----------------------------------------------------------------------------
>
>                 Key: SOLR-14347
>                 URL: https://issues.apache.org/jira/browse/SOLR-14347
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: AutoScaling
>    Affects Versions: 8.5
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Major
>             Fix For: 8.6
>
>         Attachments: SOLR-14347.patch
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Steps to reproduce:
>  * create a cluster of a few nodes (tested with 7 nodes)
>  * define per-collection policies that distribute replicas exclusively on 
> different nodes per policy
>  * concurrently create a few collections, each using a different policy
>  * resulting replica placement will be seriously wrong, causing many policy 
> violations
> Running the same scenario but instead creating collections sequentially 
> results in no violations.
> I suspect this is caused by incorrect locking level for all collection 
> operations (as defined in {{CollectionParams.CollectionAction}}) that create 
> new replica placements - i.e. CREATE, ADDREPLICA, MOVEREPLICA, DELETENODE, 
> REPLACENODE, SPLITSHARD, RESTORE, REINDEXCOLLECTION. All of these operations 
> use the policy engine to create new replica placements, and as a result they 
> change the cluster state. However, currently these operations are locked (in 
> {{OverseerCollectionMessageHandler.lockTask}} ) using 
> {{LockLevel.COLLECTION}}. In practice this means that the lock is held only 
> for the particular collection that is being modified.
> A straightforward fix for this issue is to change the locking level to 
> CLUSTER (and I confirm this fixes the scenario described above). However, 
> this effectively serializes all collection operations listed above, which 
> will result in general slow-down of all collection operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14347) Autoscaling placement wrong when concurrent replica placements are calculated

Reply via email to