[jira] [Commented] (SOLR-15045) 2x latency of synchronous commits due to serial execution on local and distributed leaders

Michael Gibney (Jira) Fri, 10 Sep 2021 14:13:04 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413391#comment-17413391
 ]


Michael Gibney commented on SOLR-15045:
---------------------------------------

Thanks for taking a look, [~markrmiller], and sincere apologies for the 
(extremely) delayed response.

Regarding the {{blockAndDoRetries()}}, tbh the "no-brainer" reason I kept it 
was that I wasn't 100% sure what it was doing, and wanted to keep this initial 
PR scoped to illustrating and fixing the immediate "commit latency problem". 
I've checked again, and fwiw I do think that this PR doesn't functionally 
change the cases in which those blocks of code would be called (not that you 
were necessarily saying otherwise).

That said, it looks to me as though the external {{blockAndDoRetries()}} calls 
do serve a purpose: to block on flushing the distributed _commits_ (nothing to 
do with previous updates, which as you say would be flushed by the "internal" 
{{blockAndDoRetries()}} calls). This presumably serves to ensure that 
synchronous "top-level" commits don't return status to the client before the 
commits and associated operations (e.g., warming) have completed on all the 
replicas, which seems like a reasonable requirement given the purpose of a 
synchronous commit. Going with that line of reasoning, this PR essentially just 
defers the {{blockAndDoRetries()}} call until _after_ it calls it's local 
blocking commit, allowing all the commits to execute in parallel and checking 
in with the remote requests after it knows the local request is complete.

{quote}I can't think of why you would make either of these blockAndDoRetries 
calls here. It's going to wait for everything outstanding on finish.
{quote}
Then IIUC both "external" {{blockAndDoRetries()}} calls in the current code 
base (the ones I jumped through hoops to preserve/defer assuming they served a 
purpose) are in fact completely redundant and could just be removed. If that's 
the case, then so much the better!

{quote}TOLEADER and FROMLEADER distrib commits
{quote}
It wasn't clear to me, initially orienting myself to this code, whether the 
class was specific to a SolrCore. In fact that was a source of some confusion 
to me, because in some senses the class seemed to be associated with a 
SolrCore, but it also (right?) distributes commits to other shards/nodes/cores. 
(I was thinking since it seems to have at least some information about more 
than one core, that it might be distributing TOLEADER requests to other 
replicas, and FROMLEADER requests to replicas for which it is itself the 
leader). Assuming you still think I'm just misunderstanding that situation, 
It's probably not worth worrying too much about it. If indeed TOLEADER and 
FROMLEADER are mutually exclusive here, then that entire last paragraph of mine 
is moot, and so much the better because it'd be one less case to be concerned 
with.

I'm curious whether you're able to replicate the issue as described by the 
tests in the PR. Whatever else I may be confused about here, I'm as near as I 
can be to 100% certain that synchronous commits currently take twice as long as 
they need to, and that this PR fixes the issue and includes a test that 
demonstrates the problem. If the fix is simpler than what I proposed -- i.e., 
we can get away with simply removing those {{blockAndDoRetries()}} calls -- 
then great!

> 2x latency of synchronous commits due to serial execution on local and 
> distributed leaders
> ------------------------------------------------------------------------------------------
>
>                 Key: SOLR-15045
>                 URL: https://issues.apache.org/jira/browse/SOLR-15045
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 8.5.2
>         Environment: Operating system: Linux (centos 7.7.1908)
>            Reporter: Raj Yadav
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hi All,
> When we issue commit through curl command, not all the shards are getting 
> `start commit` requests at the same time.
> *Solr Setup Detail : (Running in solrCloud mode)*
>  It has 6 shards, and each shard has only one replica (which is also a
>  leader) and the replica type is NRT.
>  Each shards are hosted on the separate physical host.
> Zookeeper => We are using external zookeeper ensemble (3 separate node
>  cluster)
> *Shard and Host name*
>  shard1_0=>solr_199
>  shard1_1=>solr_200
>  shard2_0=> solr_254
>  shard2_1=> solr_132
>  shard3_0=>solr_133
>  shard3_1=>solr_198
> *Request rate on the system is currently zero and only hourly indexing*
>  *running on it.*
> We are using curl command to issue commit.
> {code:java}
> curl
> "http://solr_254:8389/solr/my_collection/update?openSearcher=true&commit=true&wt=json"{code}
> (Using solr_254 host to issue commit)
> On using the above command all the shards have started processing commit (i.e
>  getting `start commit` request) except the one used in curl command (i.e
>  shard2_0 which is hosted on solr_254). Individually each shards takes around
>  10 to 12 min to process hard commit (most of this time is spent on reloading
>  external files).
>  As per logs, shard2_0 is getting `start commit` request after 10 minutes
>  (approx). This leads to following timeout error.
> {code:java}
> 2020-12-06 18:47:47.013 ERROR
> org.apache.solr.client.solrj.SolrServerException: Timeout occured while
> waiting response from server at:
> http://solr_132:9744/solr/my_collection_shard2_1_replica_n21/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2Fsolr_254%3A9744%2Fsolr%2Fmy_collection_shard2_0_replica_n11%2F
>       at
> org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:407)
>       at
> org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:753)
>       at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:369)
>       at
> org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
>       at
> org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:344)
>       at
> org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(SolrCmdDistributor.java:333)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180)
>       at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
>       at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
>     Caused by: java.util.concurrent.TimeoutException
>       at
> org.eclipse.jetty.client.util.InputStreamResponseListener.get(InputStreamResponseListener.java:216)
>       at
> org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:398)
>       ... 13 more{code}
> Above timeout error is between solr_254 and solr_132. Similar errors are
>  there between solr_254 and other 4 shards
> Since query load is zero, mostly CPU utilization is around 3%.
>  After issuing curl commit command, CPU goes up to 14% on all shards except
>  shard2_0 (host: solr_254, the one used in curl command).
>  And after 10 minutes (i.e after getting the `start commit` request)  CPU  on
>  shard2_0 also goes up to 14%.
> As I mentioned earlier each shards take around 10-12 mins to process commit
>  and due to delay in starting commit process on one shard (shard2_0) our
>  overall commit time is doubled now. (22-24 minutes approx).
> *We are observing this delay in both hard and soft commit.*
> In our solr-5.4.0(having similar setup), we use the similar curl command to 
> issue commit, and there all the shards are getting `start commit` request at 
> same time. Including the one used in curl command.
>  
> *Impact After deleting external files:*
> In order to nullify the impact of external files, I had deleted external
> files from all the shards and issued commit through the curl command. Commit
> operation got completed in 3 seconds. Individual shards took 1.5 seconds to
> complete the commit operation. But there was a delay of around 1.5 seconds
> on the shard whose hostname was used to issue the commit. Hence overall
> commit time is 3 seconds.
> During this operation, there was no timeout or any other kind of error
> (except `external file not found` error which is expected).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-15045) 2x latency of synchronous commits due to serial execution on local and distributed leaders

Reply via email to