[ 
https://issues.apache.org/jira/browse/SOLR-12833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16810203#comment-16810203
 ] 

jefferyyuan commented on SOLR-12833:
------------------------------------

[~ab]

vinfo.lockForUpdate() is using the readLock().lock(), so multiple threads can 
still execute versionAdd and versionDelete simultaneously.

The readwrite lock in VersionIfo is used to make sure there would be no update 
coming when solr is doing recovery or switch tlog, etc.

 

The problem we are trying to solve here is that when users try to update docs 
in same buckets and the update takes time, only the first will get processed, 
all other updates on same buckets have to wait and these threads would pile up 
and eventually cause OOM or unable to handle other requests as all threads are 
used up.

 

This is even worse when clients retry update (like in cross-dc env, the 
consumer will try to re-execute the commands multiple times if it fails)

By default, customers do't enable this feature, if customer hits OOM and finds 
out that there are a lot of threads are waiting for the lock on VersionBucket, 
they can enable this feature to make the Solr cluster more stable: fail fast.

We added the test at 
[https://github.com/apache/lucene-solr/pull/463/files#diff-7b816a919f7a0caf8119a684a3e71c84],
 but to make the method testable, we need change the code: the tryLockElseThrow 
 method in DistributedUpdateProcessor. We can definitely re-add the test.

 

> Use timed-out lock in DistributedUpdateProcessor
> ------------------------------------------------
>
>                 Key: SOLR-12833
>                 URL: https://issues.apache.org/jira/browse/SOLR-12833
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: update, UpdateRequestProcessors
>    Affects Versions: 7.5, 8.0
>            Reporter: jefferyyuan
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 7.7, 8.0
>
>         Attachments: SOLR-12833-noint.patch, SOLR-12833.patch, 
> SOLR-12833.patch
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> There is a synchronize block that blocks other update requests whose IDs fall 
> in the same hash bucket. The update waits forever until it gets the lock at 
> the synchronize block, this can be a problem in some cases.
>  
> Some add/update requests (for example updates with spatial/shape analysis) 
> like may take time (30+ seconds or even more), this would the request time 
> out and fail.
> Client may retry the same requests multiple times or several minutes, this 
> would make things worse.
> The server side receives all the update requests but all except one can do 
> nothing, have to wait there. This wastes precious memory and cpu resource.
> We have seen the case 2000+ threads are blocking at the synchronize lock, and 
> only a few updates are making progress. Each thread takes 3+ mb memory which 
> causes OOM.
> Also if the update can't get the lock in expected time range, its better to 
> fail fast.
>  
> We can have one configuration in solrconfig.xml: 
> updateHandler/versionLock/timeInMill, so users can specify how long they want 
> to wait the version bucket lock.
> The default value can be -1, so it behaves same - wait forever until it gets 
> the lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to