[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328028#comment-15328028 ] ASF GitHub Bot commented on SOLR-8744: -- Github user dragonsinth closed the pull request at: https://github.com/apache/lucene-solr/pull/42 > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326290#comment-15326290 ] ASF subversion and git services commented on SOLR-8744: --- Commit 4726c5b2d2efa9ba160b608d46a977d0a6b83f94 in lucene-solr's branch refs/heads/branch_6_1 from [~noble.paul] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4726c5b ] SOLR-8744: Minimize the impact on ZK when there are a lot of blocked tasks > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326247#comment-15326247 ] ASF subversion and git services commented on SOLR-8744: --- Commit ff6475c3a7229a27f7e97057fbbfadd2600032dd in lucene-solr's branch refs/heads/branch_6x from [~noble.paul] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ff6475c ] SOLR-8744: Minimize the impact on ZK when there are a lot of blocked tasks > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326246#comment-15326246 ] ASF subversion and git services commented on SOLR-8744: --- Commit 232b44e283dfd01f9ec01b4e68b09b3755a1b17a in lucene-solr's branch refs/heads/master from [~noble.paul] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=232b44e ] SOLR-8744: Minimize the impact on ZK when there are a lot of blocked tasks > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325119#comment-15325119 ] Scott Blum commented on SOLR-8744: -- That said, if you'd feel more comfortable committing your patch as-is, I can toss my tweaks down after. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324928#comment-15324928 ] Scott Blum commented on SOLR-8744: -- Attached a slightly modified patchfile with the changes I suggested. Otherwise LGTM > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324896#comment-15324896 ] Scott Blum commented on SOLR-8744: -- Also, add blockedTasks to printTrackingMaps()? > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324891#comment-15324891 ] Scott Blum commented on SOLR-8744: -- LG. I only have one suggestion left, to formulate the "fetch" section like this: {code} ArrayList heads = new ArrayList<>(blockedTasks.size() + MAX_PARALLEL_TASKS); heads.addAll(blockedTasks.values()); // If we have enough items in the blocked tasks already, it makes // no sense to read more items from the work queue. it makes sense // to clear out at least a few items in the queue before we read more items if (heads.size() < MAX_BLOCKED_TASKS) { //instead of reading MAX_PARALLEL_TASKS items always, we should only fetch as much as we can execute int toFetch = Math.min(MAX_BLOCKED_TASKS - heads.size(), MAX_PARALLEL_TASKS - runningTasks.size()); List newTasks = workQueue.peekTopN(toFetch, excludedTasks, 2000L); log.debug("Got {} tasks from work-queue : [{}]", newTasks.size(), newTasks.toString()); heads.addAll(newTasks); } else { Thread.sleep(1000); } if (isClosed) break; if (heads.isEmpty()) { continue; } blockedTasks.clear(); // clear it now; may get refilled below. {code} This prevents two problems: 1) The log.debug message "Got {} tasks from work-queue" won't keep reporting blockedTasks as if they were freshly fetched. 2) When the blockedTask map gets completely full, the Thread.sleep() will prevent a free-spin (the only other real pause in the loop is peekTopN) > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) -
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324026#comment-15324026 ] Noble Paul commented on SOLR-8744: -- That is true, they may not be necessarily blocked, They are essentially items read from work queue which we don't want to refetch > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324024#comment-15324024 ] Noble Paul commented on SOLR-8744: -- I will have to beast this test and see if I can reproduce this > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324022#comment-15324022 ] Noble Paul commented on SOLR-8744: -- Yes, but I'm not sure if it's worth. The other counts asnd assumptions will be valid if I do this. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323589#comment-15323589 ] ASF GitHub Bot commented on SOLR-8744: -- Github user dragonsinth commented on the issue: https://github.com/apache/lucene-solr/pull/42 FYI: both of these SHAs passed the test suite > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323478#comment-15323478 ] Scott Blum commented on SOLR-8744: -- See second commit in https://github.com/apache/lucene-solr/pull/42 > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323476#comment-15323476 ] ASF GitHub Bot commented on SOLR-8744: -- GitHub user dragonsinth opened a pull request: https://github.com/apache/lucene-solr/pull/42 SOLR-8744 blockedTasks @noblepaul You can merge this pull request into a Git repository by running: $ git pull https://github.com/fullstorydev/lucene-solr SOLR-8744 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/42.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #42 commit 668d426633fe5551b24ab38036e14c15e7ed4cdf Author: Scott BlumDate: 2016-06-09T21:28:07Z WIP: SOLR-8744 blockedTasks commit 4b2512abcc5e55c653c923eba9762988cf1faae8 Author: Scott Blum Date: 2016-06-09T22:11:26Z Simplifications > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323447#comment-15323447 ] Scott Blum commented on SOLR-8744: -- One other comment: {code} // We are breaking out if we already have reached the no:of parallel tasks running // By doing so we may end up discarding the old list of blocked tasks . But we have // no means to know if they would still be blocked after some of the items ahead // were cleared. if (runningTasks.size() >= MAX_PARALLEL_TASKS) break; {code} In this case, can't we just shove all the remaining tasks into blockedTasks? It's a slight redefinition to mean either "tasks that are blocked because of locks" or "tasks blocked because too many are running". > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323436#comment-15323436 ] Scott Blum commented on SOLR-8744: -- Actually, I may have a fix. You need a Thread.sleep() in the final loop or you can burn through 500 iterations on a fast machine. Here's a patch. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, > SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323393#comment-15323393 ] Scott Blum commented on SOLR-8744: -- I got one test failure patching this into master: MultiThreadedOCPTest fails reliably: java.lang.AssertionError at __randomizedtesting.SeedInfo.seed([7F3C69CF3DAFFDD5:F76856159353902D]:0) at org.junit.Assert.fail(Assert.java:92) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertTrue(Assert.java:54) at org.apache.solr.cloud.MultiThreadedOCPTest.testFillWorkQueue(MultiThreadedOCPTest.java:106) > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, > SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323207#comment-15323207 ] Scott Blum commented on SOLR-8744: -- BTW, I landed SOLR-9191 in master and 6x, so you should be good to go on that front. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, > SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323205#comment-15323205 ] Scott Blum commented on SOLR-8744: -- Mostly LG. One completely minor comment: {code} + if (blockedTasks.size() < MAX_BLOCKED_TASKS) { +blockedTasks.put(head.getId(), head); + } {code} Maybe unnecessary? It doesn't actually matter if blockedTasks.size() gets to 1100 items, the check at the top of the loop will keep it from growing endlessly. If you drop these tasks on the floor, you'll just end up re-fetching the bytes from ZK later. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul >Priority: Blocker > Labels: sharding, solrcloud > Fix For: 6.1 > > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, > SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319849#comment-15319849 ] Noble Paul commented on SOLR-8744: -- Sure Scott. Let's commit that first. Let's make these two a blocker for 6.1.0 release. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319289#comment-15319289 ] Scott Blum commented on SOLR-8744: -- btw: mind if I land SOLR-9191 first? I will potentially need to backport that change to a bunch of branches which don't have the LockTree stuff, so it'll be a little cleaner port if I land first. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319026#comment-15319026 ] Scott Blum commented on SOLR-8744: -- Yeah 1000 seems totally reasonable. Beyond that, it seems kind of pointless; if you have 1000 tasks blocked, chances are that subsequent tasks aren't going to be able to run either. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318967#comment-15318967 ] Noble Paul commented on SOLR-8744: -- bq. think we need a cap on the total size of blockedTasks; otherwise in the face of a cluster level task, you'll read everything in ZK into memory continually and grow blockedTasks without bound. It's a tricky situation. It is very cheap to hold a few 1000 tasks in memory (or even 1s) . While we have limited no:of threads to process. It still makes sense to read newer tasks hoping that they will be able to process without getting blocked. I shall address these issues (as much as possible) and post a fresh patch. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318816#comment-15318816 ] Scott Blum commented on SOLR-8744: -- Some quick feedback: - I think we need a cap on the total size of blockedTasks; otherwise in the face of a cluster level task, you'll read everything in ZK into memory continually and grow blockedTasks without bound. 1) "while (runningTasks.size() > MAX_PARALLEL_TASKS) { ...wait... }" should account for blockedTasks.size() as well, or else maybe there should be a separate sleep check on blockedTasks being too full 2) "workQueue.peekTopN(MAX_PARALLEL_TASKS" could also peek for a smaller number of tasks when blockedTasks() is almost full. Something like: workQueue.peekTopN(Math.min(MAX_PARALLEL_TASKS - runningTasks.size(), MAX_BLOCKED_TASKS - blockedTasks.size(), ... - excludedTasks could be a Sets.union() of the other two if you want to automatically retain things like toString() and size() instead of a Function > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15317982#comment-15317982 ] Noble Paul commented on SOLR-8744: -- We have another problem because of tasks which cannot be run because they can't acquire a lock. imagine we have around 2000 items (this is the current limit) in the work-queue. New tasks are added to the work queue. The call to {{peekTopN()}} would return only the same set of 2000 items. Newly added items do not get fetched. This leads to 2 problems # New tasks which could be processed, do not get a chance # ZK is unnecessarily hammered with requests to fetches the same set of items again and again. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312017#comment-15312017 ] ASF subversion and git services commented on SOLR-8744: --- Commit 734dcb99fcdd557ef046166193ae804824023db3 in lucene-solr's branch refs/heads/branch_6x from [~noble.paul] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=734dcb9 ] SOLR-8744: Overseer operations performed with fine grained mutual exclusion > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312007#comment-15312007 ] ASF subversion and git services commented on SOLR-8744: --- Commit 459a9c77a6a9b807deb98c58225d4d0ec1f75bac in lucene-solr's branch refs/heads/master from [~noble.paul] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=459a9c7 ] SOLR-8744: Overseer operations performed with fine grained mutual exclusion > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15311583#comment-15311583 ] Noble Paul commented on SOLR-8744: -- I'm planning to commit this soon > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15311044#comment-15311044 ] Scott Blum commented on SOLR-8744: -- Hi, any update on this! Seemed like it was really close. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SOLR-8744.patch, SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300417#comment-15300417 ] Scott Blum commented on SOLR-8744: -- Good stuff, new LockTree looking good. When does TaskBatch.batchId get assigned? It looks like it's always 0 right now. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SOLR-8744.patch, > SmileyLockTree.java, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299479#comment-15299479 ] Noble Paul commented on SOLR-8744: -- bq. One more big question for me: why does the LockTree need a clusterStateProvider? Exactly. I could create a lock blindly and don't bother whether that entity actually exists. bq.One more comment, instead of making LockTree.LockObject public True > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, > SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298841#comment-15298841 ] Scott Blum commented on SOLR-8744: -- Relatedly, lockTask should probably ensure that the appropriate number of path elements are non-nil for the given lock level. E.g. a shard lock level should ensure that message.getStr(ZkStateReader.SHARD_ID_PROP) is non-null. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, > SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298839#comment-15298839 ] Scott Blum commented on SOLR-8744: -- [~noble.paul] One more big question for me: why does the LockTree need a clusterStateProvider? I don't follow the need to call findChildren in order to prepopulate locks at a given level. In many cases, you might need to lock something that doesn't really even exist. For example, create collection or create shard or create replica should acquire the lock for the thing they are about to try to create, even if the thing doesn't exist yet. This code: {code} if (child == null) { LOG.info("locktree_Unable to get path for " + currentAction + " path " + path); return FREELOCK;//no such entity . So no need to lock } {code} It seems to me that this basically isn't true. I don't think the LockTree needs knowledge of the actual elements in the real cluster state tree? I'm also not sure the Level actually matters beyond OCMH.lockTask, where we really just need to translate the lock level into the right-length path. Does that make any sense? > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, > SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298822#comment-15298822 ] Scott Blum commented on SOLR-8744: -- Or make "OverseerLock" a top-level interface type in the package. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, > SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298821#comment-15298821 ] Scott Blum commented on SOLR-8744: -- [~noblepaul] One more comment, instead of making LockTree.LockObject public, I think you could just have LockObject implement the Lock interface and return the interface. Then you don't have to do the "return lock == null ? null : lock::unlock" weirdness. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, > SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298781#comment-15298781 ] David Smiley commented on SOLR-8744: Maybe my ideas to simplify should really be its own issue because there's apparently more to it than the locking -- it's interrelated with OverseerTaskProcessor itself. I think the SmileyLockTree.java could be part of a solution (assuming I'm not missing requirements overall by the OTP) but by itself with OTP's present design, it is perhaps not enough / not right. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, > SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298756#comment-15298756 ] David Smiley commented on SOLR-8744: bq. (Noble) David , It works. But it is not yet complete. But, I miss the point. What are we trying to solve? Is the current implementation buggy? We just need one correct implementation. I am merely offering an alternative implementation to part of the task that I feel is more simple / elegant (as I stated). In part I'm doing this because it's fun/interesting :-) Other than complexity of it and related code already in OverseerTaskProcessor, I'm not sure what bugs may or may not be in your patch or the existing code. bq. (Scott) Have you run your impl against the test Noble Paul wrote? Curious if it passes. No; I'd like to do that. I suggest [~noble.paul] commit to a branch, push, and I commit on top (assuming tests pass). Of course it will be reverted if you all don't like it. bq. (Scott) LayeredLock.tryLock() doesn't really implement markBusy() because it doesn't retain readLocks on the parent chain if you fail to lock the child. To implement markBusy you'd need to leave them locked (and have a way to later completely reset all the read locks). Thanks for clarifying on IRC why the Overseer would want to use tryLock() instead of lock() -- something that confused me from your question/statement. I _had thought_ OverseerTaskProcessor was going to spawn off a task/thread (more likely Executor impl) that would simply call lock() in its own thread and it may wait as long as needed to acquire the lock and to do its task. I am quite ready to admit I don't know what's actually going on here ;-P which is apparently the case. Let me ask you this then... why *doesn't* it work in this manner? It sounds simple enough. I do see an executor (tpe field), but there's so much surrounding code that it is confusing my ability to grok it and thus why tryLock vs lock might be wanted. In such a design (in my simple conceptual model), fairness=true should be on the ReentrantReadWriteLocks in SmileyLockTree to prevent starvation to get FIFO behavior on the write lock. If it's too complicated to explain, maybe let me have at it on a branch. And again, I may very well not appreciate (yet) why OverseerTaskProcessor with it's proposed LockTree needs to be so complicated. So please don't take any of my questions or suggestions as a slight. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, > SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock,
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298486#comment-15298486 ] Noble Paul commented on SOLR-8744: -- David , It works. But it is not yet complete. But, I miss the point. What are we trying to solve? Is the current implementation buggy? We just need one correct implementation. I have no reason to reimplement everything unless there is something wrong with the current implementation. I'm mostly done with the current patch and I plan to commit this as soon as I clean up the cruft. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, > SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298482#comment-15298482 ] Scott Blum commented on SOLR-8744: -- Have you run your impl against the test Noble Paul wrote? Curious if it passes. I think I grok it, but I found a few problems on inspection: - LayeredLock.tryLock() doesn't really implement markBusy() because it doesn't retain readLocks on the parent chain if you fail to lock the child. To implement markBusy you'd need to leave them locked (and have a way to later completely reset all the read locks). - I still think implementing the j.u.c.L interface is silly here, tryLock() is the only operation that makes sense in context. - You can move the synchronized(locks) from the private recursive method into the public one and acquire it once. - You can move the ArrayList copy from the private recursive method into the public one and only make a single copy that you later take sublists of. - Just as a note: you'd need to occasionally throw away the entire tree as it only grows and never shrinks. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SOLR-8744.patch, SmileyLockTree.java, > SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297771#comment-15297771 ] Noble Paul commented on SOLR-8744: -- Thanks [~dsmiley] The implementation is *wrong* It's not enough to ensure that your parent is not locked. It is also required that none of your children are locked too. For example, {{C1}} cannot be locked if {{C1/S1/R1}} or {{C1/S2/R1}} is locked The flat map and substring based approach is expensive in terms of resources. Every substring() call creates a new String. It does not have to be that expensive when you can achieve the same with a map lookup. It may look simple, but I would say it's inefficient and lazy. Using the java Lock is contrived because we don't need any of that functionality. We just need a mark() unmark() functionality. There are no real locks required here. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch, SmileyLockTree.java > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297713#comment-15297713 ] Noble Paul commented on SOLR-8744: -- bq. Node would keep track of its parent only, not children. I do not know how to traverse down a tree without keeping track of the children. bq. No Session or LockObject. The session is created for a purpose. It is to abstract out the starvation logic. I'm not sure I understand your design. Maybe an alternate patch with an implementation of LockTree would help > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297710#comment-15297710 ] Noble Paul commented on SOLR-8744: -- bq. In OverseerTaskProcessor, handing off the lock object (up to and including the tpe.execute call) should probably move inside the try block; on successfully executing the task, null out the lock local, and then put an "if lock != null unlock" into a finally block. The runner is always executed in a different thread. So the finally block will always evaluate {{lock != null}} to true. Actually, only in case of an exception, OverseerTaskProcessor should unlock it bq.Do we need some kind of complete reset in the event stuff really blows up? Should there be a session-level clear that just unlocks everything? Yes, we need it, but not at the session level. If I clear up everything for each session, it defeats the purpose. The session is valid for only one batch. The locks should survive for multiple batches. A better solution is to periodically check if the running tasks are empty and if yes, just create a new {{LockTree}}. I was planning t do it anyway bq. I think MigrateStateFormat could be collection level rather than cluster level? I have set the lock levels more pessimistically. We should re-evaluate them > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297100#comment-15297100 ] Scott Blum commented on SOLR-8744: -- Digging into the LockTree now to understand it better. Some preliminary notes: - In OverseerTaskProcessor, handing off the lock object (up to and including the tpe.execute call) should probably move inside the try block; on successfully executing the task, null out the lock local, and then put an "if lock != null unlock" into a finally block. - Do we need some kind of complete reset in the event stuff really blows up? Should there be a session-level clear that just unlocks everything? - I think MigrateStateFormat could be collection level rather than cluster level? > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297086#comment-15297086 ] David Smiley commented on SOLR-8744: The main purpose of my suggestion is simplicity: LockTree (public) and internally LockTree.Node (private) -- that's it; no new abstractions, and it'd have very little code. Node would keep track of it's parent only, not children. No Session or LockObject. Returning a JDK Lock helps in that regard (it is an existing abstraction and if we wanted to it'd be easy to support all its methods) but it's not critical -- my suggested LockTree.Node could be public and we don't implement Lock if you feel that's better. I disagree with you on the semantics being different vs the same but whatever. It just so happens that my suggestion might yield something general purpose but that's not my objective, it's simplicity. Locking on fewer objects is nice too, but perhaps it won't become a real issue with your proposal. Perhaps I'm not fully appreciating the task at hand -- I admit I glossed over the starvation issue and the code here concerned with marking busy/not. I hoped the "fairness" boolean configuration to JDK's ReentrantReadWriteLock would address that but maybe not? bq. (me) Getting the key is a simple matter of stringKey.substring(0, stringKey.lastIndexOf('/')) Just a minor implementation detail to get a parent key instead of working with a List -- nevermind. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296569#comment-15296569 ] Noble Paul commented on SOLR-8744: -- Thanks [~dsmiley] for your comments bq.which we only obtain a fixed small number of locks, which is for each parent. I miss that . Only one operation should be able to lock at a level. bq.lockTree maintains a cache/map of string keys to weakly reference-able LockNode. It's an optimisation that could be done. Here we are just talking about a few hundred (may be afew thousand? ) extremely lightweight objects bq.Getting the key is a simple matter of stringKey.substring(0, stringKey.lastIndexOf('/')) I didn't get that one bq.LockNode implements the JDK Lock interface Java {{Lock}} interface makes me implement too many things. In this, we don't even expose the {{lock()}} method. The semantics are much different here. implementing/using java {{Lock}} interface sets a wrong set of expectations. {{LockTree}} is not a general purpose utility. I wrote it as an independent class because it is a complex enough functionality which should be tested independently. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296426#comment-15296426 ] David Smiley commented on SOLR-8744: I looked over the patch just a little bit but the suggestion I'm about to give is mostly based on the issue description & comments. This is an interesting/fun programming problem. The main thing I don't like about the current design (and this is not a blocker, I'm not vetoing) is that we obtain locks recursively on down to every object of lower depths (ultimately to replicas). Instead, I propose we consider an alternative design (to follow) in which we only obtain a fixed small number of locks, which is for each parent. This lowers the number of locks, and also allows the actual lock nodes to be much more dynamic, being GC'ed when the locks aren't in use. Proposal: * LockTree maintains a cache/map of string keys to weakly reference-able LockNode. That's right; they'll get GC'ed when not in use. The root node will be strongly referenced apart from the map so it's always there. It's just a string key, with expectation we'll use some separator to denote hierarchy. When looking up a lock, we simply synchronize as obtaining locks will be infrequent. When looking up a Lock, it will recursively look up a parent, creating it first if not there. Getting the key is a simple matter of stringKey.substring(0, stringKey.lastIndexOf('/')) and assuming all keys start with a leading "/" (thus the root key is the empty string). * LockNode implements the JDK Lock interface, but only implementing {{lock()}} and {{unlock()}}. This makes the API easy for consumers -- it's a known abstraction for code using this utility. * LockNode's fields consist of a parent LockNode and a JDK ReentrantReadWriteLock. The ReentrantReadWriteLock is instantiated with fairness=true. * LockNode.lock: call parent.readLock() (see below) then call readWriteLock.writeLock().lock(). * LockNode.unlock: call readWriteLock.writeLock.unlock(); then parent.readUnlock(); (see below) * LockNode.readLock (private method): call readLock on parent first (this is recursive), and then after that call readWriteLock.readLock().lock(). Thus we lock from top-down (from root down) in effect. * LockNode.readUnlock (private method): call readWriteLock.readLock() then recursively call parent.readUnlock(). What do you think? > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295648#comment-15295648 ] Noble Paul commented on SOLR-8744: -- Patch with tests . More testing required. But please review the class {{LockTree}} to see the logic implemented > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > Attachments: SOLR-8744.patch > > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281187#comment-15281187 ] Scott Blum commented on SOLR-8744: -- Good plan! > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281175#comment-15281175 ] Noble Paul commented on SOLR-8744: -- good question. Starvation is indeed a problem and we must make the best effort to run tasks on a first come first served basis We can address it by making the scheduling itself smarter as follows: As the scheduler thread starts with trying to acquire a lock, it could build a tree of its own with nodes marked 'busy' in a local tree for previously unassaignable tasks. The local tree would look as follows. If the Task {{T1}} needs a lock on Collection {{C1}} and it is unable to acquire a lock, The local tree would look as follows {code} Cluster / / C1 {code} When task {{T2}} is taken up, which requires a lock on {{Cluster -> C1 -> S1}} , It would not try to acquire a lock from loct tree because the path {{Cluster ->C1}} is already is marked as 'busy' in the local tree. So {{T2}} would remain in the work queue till {{T1}} is completed. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8744) Overseer operations need more fine grained mutual exclusion
[ https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280534#comment-15280534 ] Scott Blum commented on SOLR-8744: -- Design looks good to me. One think to talk through, this may not be a problem per se, but I could imagine starvation becoming an issue. If a task is trying to lock the cluster it may never get to run if smaller tasks keep acquiring pieces of the lock tree. This may not be an issue in practice. > Overseer operations need more fine grained mutual exclusion > --- > > Key: SOLR-8744 > URL: https://issues.apache.org/jira/browse/SOLR-8744 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Noble Paul > Labels: sharding, solrcloud > > SplitShard creates a mutex over the whole collection, but, in practice, this > is a big scaling problem. Multiple split shard operations could happen at > the time time, as long as different shards are being split. In practice, > those shards often reside on different machines, so there's no I/O bottleneck > in those cases, just the mutex in Overseer forcing the operations to be done > serially. > Given that a single split can take many minutes on a large collection, this > is a bottleneck at scale. > Here is the proposed new design > There are various Collection operations performed at Overseer. They may need > exclusive access at various levels. Each operation must define the Access > level at which the access is required. Access level is an enum. > CLUSTER(0) > COLLECTION(1) > SHARD(2) > REPLICA(3) > The Overseer node maintains a tree of these locks. The lock tree would look > as follows. The tree can be created lazily as and when tasks come up. > {code} > Legend: > C1, C2 -> Collections > S1, S2 -> Shards > R1,R2,R3,R4 -> Replicas > Cluster > / \ >/ \ > C1 C2 > / \ / \ > / \ / \ >S1 S2 S1 S2 > R1, R2 R3.R4 R1,R2 R3,R4 > {code} > When the overseer receives a message, it tries to acquire the appropriate > lock from the tree. For example, if an operation needs a lock at a Collection > level and it needs to operate on Collection C1, the node C1 and all child > nodes of C1 must be free. > h2.Lock acquiring logic > Each operation would start from the root of the tree (Level 0 -> Cluster) and > start moving down depending upon the operation. After it reaches the right > node, it checks if all the children are free from a lock. If it fails to > acquire a lock, it remains in the work queue. A scheduler thread waits for > notification from the current set of tasks . Every task would do a > {{notify()}} on the monitor of the scheduler thread. The thread would start > from the head of the queue and check all tasks to see if that task is able to > acquire the right lock. If yes, it is executed, if not, the task is left in > the work queue. > When a new task arrives in the work queue, the schedulerthread wakes and just > try to schedule that task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org