[jira] [Updated] (SOLR-3180) ChaosMonkey test failures

Yonik Seeley (JIRA) Fri, 04 Jan 2013 08:06:15 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yonik Seeley updated SOLR-3180:
-------------------------------

    Attachment: fail.130103_193722.txt

Here's an analyzed log that I traced all the way to the end.
The issues involved are all timeout related (socket timeouts).
Timing out an update request in general is bad, since the request itself 
normally continues on and can finish at some point in the future.
We should strive to only time out requests that are truely / hopelessly hung.

{code}

There was a lot of timeout / retry activity that could cause problems for other 
tests / scenarios, but this test is simpler
because it waits for a response to the add before moving on to possibly delete 
that add.  For this scenario, the
retry that caused the issue was from the cloud client.  It timed out it's 
original update and retried the update.  The retry completed.  Then the test 
deleted that document.  Then the *original* update succeeded and added the doc 
back.

Having the same timeouts on forwards to leaders as forwards from leaders has 
turned out to be not-so-good.  Because the former happens *before* the latter, 
if a replica update hangs, the to_leader update will timeout and retry 
*slightly* before the from_leader times out to the replica (and maybe succeeds 
by asking that replica to recover!).

Q) A replica receiving a forward *from* a leader - do we really need to have a 
ZK connection to accept that update?
Maybe so for defensive check reasons?

Here's how I think we need to fix this:
A) We need to figure out how long an update to a replica forwarded by the 
leader can reasonably take.  Then we need to make the socket timeout be greater 
than that.
B) We need to figure out how long an update to a leader can take (taking into 
account (A)), and make the socket timeout to the leader greater than that.
C) We need to figure out how long an update to a non-leader (which is then 
forwarded to a leader) can take, and make the socket timeout from SolrJ servers 
to be greater than that.
D) Make sure that the generic Jetty socket timeouts are greater than all of the 
above?

If it's too hard to separate all these different socket timeouts now, then the 
best approximation
would be to try and minimize the time any update can take, and raise all of the 
timeouts up high enough
such that we should never see them.

We should probably also take care to only retry in certain scenarios.  For 
instance if we try to forward to a leader, but can't reach the leader.  We 
should retry on connect timeout, but never on socket timeout.
                
> ChaosMonkey test failures
> -------------------------
>
>                 Key: SOLR-3180
>                 URL: https://issues.apache.org/jira/browse/SOLR-3180
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Yonik Seeley
>         Attachments: CMSL_fail1.log, CMSL_hang_2.txt, CMSL_hang.txt, 
> fail.130101_034142.txt, fail.130102_020942.txt, fail.130103_105104.txt, 
> fail.130103_193722.txt, fail.inconsistent.txt, test_report_1.txt
>
>
> Handle intermittent failures in the ChaosMonkey tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3180) ChaosMonkey test failures

Reply via email to