[ 
https://issues.apache.org/jira/browse/SOLR-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296426#comment-15296426
 ] 

David Smiley commented on SOLR-8744:
------------------------------------

I looked over the patch just a little bit but the suggestion I'm about to give 
is mostly based on the issue description & comments.  This is an 
interesting/fun programming problem.  The main thing I don't like about the 
current design (and this is not a blocker, I'm not vetoing) is that we obtain 
locks recursively on down to every object of lower depths (ultimately to 
replicas).  Instead, I propose we consider an alternative design (to follow) in 
which we only obtain a fixed small number of locks, which is for each parent.  
This lowers the number of locks, and also allows the actual lock nodes to be 
much more dynamic, being GC'ed when the locks aren't in use.
Proposal:
* LockTree maintains a cache/map of string keys to weakly reference-able 
LockNode.  That's right; they'll get GC'ed when not in use.  The root node will 
be strongly referenced apart from the map so it's always there.  It's just a 
string key, with expectation we'll use some separator to denote hierarchy.  
When looking up a lock, we simply synchronize as obtaining locks will be 
infrequent.  When looking up a Lock, it will recursively look up a parent, 
creating it first if not there.  Getting the key is a simple matter of 
stringKey.substring(0, stringKey.lastIndexOf('/')) and assuming all keys start 
with a leading "/" (thus the root key is the empty string).
* LockNode implements the JDK Lock interface, but only implementing {{lock()}} 
and {{unlock()}}.  This makes the API easy for consumers -- it's a known 
abstraction for code using this utility.
* LockNode's fields consist of a parent LockNode and a JDK 
ReentrantReadWriteLock.  The ReentrantReadWriteLock is instantiated with 
fairness=true.
* LockNode.lock: call parent.readLock() (see below) then call 
readWriteLock.writeLock().lock().
* LockNode.unlock: call readWriteLock.writeLock.unlock(); then 
parent.readUnlock(); (see below)
* LockNode.readLock (private method): call readLock on parent first (this is 
recursive), and then after that call readWriteLock.readLock().lock().  Thus we 
lock from top-down (from root down) in effect.
* LockNode.readUnlock (private method): call readWriteLock.readLock() then 
recursively call parent.readUnlock().

What do you think?

> Overseer operations need more fine grained mutual exclusion
> -----------------------------------------------------------
>
>                 Key: SOLR-8744
>                 URL: https://issues.apache.org/jira/browse/SOLR-8744
>             Project: Solr
>          Issue Type: Improvement
>          Components: SolrCloud
>    Affects Versions: 5.4.1
>            Reporter: Scott Blum
>            Assignee: Noble Paul
>              Labels: sharding, solrcloud
>         Attachments: SOLR-8744.patch
>
>
> SplitShard creates a mutex over the whole collection, but, in practice, this 
> is a big scaling problem.  Multiple split shard operations could happen at 
> the time time, as long as different shards are being split.  In practice, 
> those shards often reside on different machines, so there's no I/O bottleneck 
> in those cases, just the mutex in Overseer forcing the operations to be done 
> serially.
> Given that a single split can take many minutes on a large collection, this 
> is a bottleneck at scale.
> Here is the proposed new design
> There are various Collection operations performed at Overseer. They may need 
> exclusive access at various levels. Each operation must define the Access 
> level at which the access is required. Access level is an enum. 
> CLUSTER(0)
> COLLECTION(1)
> SHARD(2)
> REPLICA(3)
> The Overseer node maintains a tree of these locks. The lock tree would look 
> as follows. The tree can be created lazily as and when tasks come up.
> {code}
> Legend: 
> C1, C2 -> Collections
> S1, S2 -> Shards 
> R1,R2,R3,R4 -> Replicas
>                  Cluster
>                 /       \
>                /         \         
>               C1          C2
>              / \         /   \     
>             /   \       /     \      
>            S1   S2      S1     S2
>         R1, R2  R3.R4  R1,R2   R3,R4
> {code}
> When the overseer receives a message, it tries to acquire the appropriate 
> lock from the tree. For example, if an operation needs a lock at a Collection 
> level and it needs to operate on Collection C1, the node C1 and all child 
> nodes of C1 must be free. 
> h2.Lock acquiring logic
> Each operation would start from the root of the tree (Level 0 -> Cluster) and 
> start moving down depending upon the operation. After it reaches the right 
> node, it checks if all the children are free from a lock.  If it fails to 
> acquire a lock, it remains in the work queue. A scheduler thread waits for 
> notification from the current set of tasks . Every task would do a 
> {{notify()}} on the monitor of  the scheduler thread. The thread would start 
> from the head of the queue and check all tasks to see if that task is able to 
> acquire the right lock. If yes, it is executed, if not, the task is left in 
> the work queue.  
> When a new task arrives in the work queue, the schedulerthread wakes and just 
> try to schedule that task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to