[ 
https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-8744:
-----------------------------
    Description: 
SplitShard creates a mutex over the whole collection, but, in practice, this is 
a big scaling problem.  Multiple split shard operations could happen at the 
time time, as long as different shards are being split.  In practice, those 
shards often reside on different machines, so there's no I/O bottleneck in 
those cases, just the mutex in Overseer forcing the operations to be done 
serially.

Given that a single split can take many minutes on a large collection, this is 
a bottleneck at scale.

Here is the proposed new design

There are various Collection operations performed at Overseer. They may need 
exclusive access at various levels. Each operation must define the Access level 
at which the access is required. Access level is an enum. 

CLUSTER(0)
COLLECTION(1)
SHARD(2)
REPLICA(3)

The Overseer node maintains a tree of these locks. The lock tree would look as 
follows. The tree can be created lazily as and when tasks come up.
{code}
Legend: 
C1, C2 -> Collections
S1, S2 -> Shards 
R1,R2,R3,R4 -> Replicas


                 Cluster
                /       \
               /         \         
              C1          C2
             / \         /   \     
            /   \       /     \      
           S1   S2      S1     S2
        R1, R2  R3.R4  R1,R2   R3,R4
{code}

When the overseer receives a message, it tries to acquire the appropriate lock 
from the tree. For example, if an operation needs a lock at a Collection level 
and it needs to operate on Collection C1, the node C1 and all child nodes of C1 
must be free. 

h2.Lock acquiring logic

Each operation would start from the root of the tree (Level 0 -> Cluster) and 
start moving down depending upon the operation. After it reaches the right 
node, it checks if all the children are free from a lock.  If it fails to 
acquire a lock, it remains in the work queue. A scheduler thread waits for 
notification from the current set of tasks . Every task would do a {{notify()}} 
on the monitor of  the scheduler thread. The thread would start from the head 
of the queue and check all tasks to see if that task is able to acquire the 
right lock. If yes, it is executed, if not, the task is left in the work queue. 
 
When a new task arrives in the work queue, the schedulerthread wakes and just 
try to schedule that task.

  was:
SplitShard creates a mutex over the whole collection, but in practice this is a 
big scaling problem.  Multiple split shard operations could happen at the time 
time, as long as different shards are being split.  In practice, those shards 
often reside on different machines, so there's no I/O bottleneck in those 
cases, just the mutex in Overseer forcing the operations to be done serially.

Given that a single split can take many minutes on a large collection, this is 
a bottleneck at scale.

Here is the proposed new design

There are various Collection operations performed at Overseer. They may need 
exclusive access at various levels. Each operation must define the Access level 
at which the access is required. Access level is an enum. 

CLUSTER(0)
COLLECTION(1)
SHARD(2)
REPLICA(3)

The Overseer node maintains a tree of these locks. The lock tree would look as 
follows. The tree can be created lazily as and when tasks come up.

Legend: 
C1, C2 -> Collections
S1, S2 -> Shards 
R1,R2,R3,R4 -> Replicas
{code}

                 Cluster
                /       \
               /         \         
              C1          C2
             / \         /   \     
            /   \       /     \      
           S1   S2      S1     S2
        R1, R2  R3.R4  R1,R2   R3,R4
{code}

When the overseer receives a message, it tries to acquire the appropriate lock 
from the tree. For example, if an operation needs a lock at a Collection level 
and it needs to operate on Collection C1, the node C1 and all child nodes of C1 
must be free. 

h2.Lock acquiring logic

Each operation would start from the root of the tree (Level 0 -> Cluster) and 
start moving down depending upon the operation. After it reaches the right 
node, it checks if all the children are free from lock.  If it fails to acquire 
a lock, it remains in the work queue. A scheduler thread waits for notification 
from current set of tasks . Every task would do a notify() on the monitor of  
the scheduler thread. The thread would start from the head of the queue and 
check all tasks to see if that task is able to acquire the right lock. If yes, 
it is executed, if not, the task is left in the work queue.  
When a new task arrives in the work queue, the schedulerthread wakes and just 
try to schedule that task.


> SplitShard needs finer-grained mutual exclusion
> -----------------------------------------------
>
>                 Key: SOLR-8744
>                 URL: https://issues.apache.org/jira/browse/SOLR-8744
>             Project: Solr
>          Issue Type: Improvement
>          Components: SolrCloud
>    Affects Versions: 5.4.1
>            Reporter: Scott Blum
>            Assignee: Noble Paul
>              Labels: sharding, solrcloud
>
> SplitShard creates a mutex over the whole collection, but, in practice, this 
> is a big scaling problem.  Multiple split shard operations could happen at 
> the time time, as long as different shards are being split.  In practice, 
> those shards often reside on different machines, so there's no I/O bottleneck 
> in those cases, just the mutex in Overseer forcing the operations to be done 
> serially.
> Given that a single split can take many minutes on a large collection, this 
> is a bottleneck at scale.
> Here is the proposed new design
> There are various Collection operations performed at Overseer. They may need 
> exclusive access at various levels. Each operation must define the Access 
> level at which the access is required. Access level is an enum. 
> CLUSTER(0)
> COLLECTION(1)
> SHARD(2)
> REPLICA(3)
> The Overseer node maintains a tree of these locks. The lock tree would look 
> as follows. The tree can be created lazily as and when tasks come up.
> {code}
> Legend: 
> C1, C2 -> Collections
> S1, S2 -> Shards 
> R1,R2,R3,R4 -> Replicas
>                  Cluster
>                 /       \
>                /         \         
>               C1          C2
>              / \         /   \     
>             /   \       /     \      
>            S1   S2      S1     S2
>         R1, R2  R3.R4  R1,R2   R3,R4
> {code}
> When the overseer receives a message, it tries to acquire the appropriate 
> lock from the tree. For example, if an operation needs a lock at a Collection 
> level and it needs to operate on Collection C1, the node C1 and all child 
> nodes of C1 must be free. 
> h2.Lock acquiring logic
> Each operation would start from the root of the tree (Level 0 -> Cluster) and 
> start moving down depending upon the operation. After it reaches the right 
> node, it checks if all the children are free from a lock.  If it fails to 
> acquire a lock, it remains in the work queue. A scheduler thread waits for 
> notification from the current set of tasks . Every task would do a 
> {{notify()}} on the monitor of  the scheduler thread. The thread would start 
> from the head of the queue and check all tasks to see if that task is able to 
> acquire the right lock. If yes, it is executed, if not, the task is left in 
> the work queue.  
> When a new task arrives in the work queue, the schedulerthread wakes and just 
> try to schedule that task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to