[ https://issues.apache.org/jira/browse/SOLR-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yonik Seeley updated SOLR-3180: ------------------------------- Attachment: fail.130103_193722.txt Here's an analyzed log that I traced all the way to the end. The issues involved are all timeout related (socket timeouts). Timing out an update request in general is bad, since the request itself normally continues on and can finish at some point in the future. We should strive to only time out requests that are truely / hopelessly hung. {code} There was a lot of timeout / retry activity that could cause problems for other tests / scenarios, but this test is simpler because it waits for a response to the add before moving on to possibly delete that add. For this scenario, the retry that caused the issue was from the cloud client. It timed out it's original update and retried the update. The retry completed. Then the test deleted that document. Then the *original* update succeeded and added the doc back. Having the same timeouts on forwards to leaders as forwards from leaders has turned out to be not-so-good. Because the former happens *before* the latter, if a replica update hangs, the to_leader update will timeout and retry *slightly* before the from_leader times out to the replica (and maybe succeeds by asking that replica to recover!). Q) A replica receiving a forward *from* a leader - do we really need to have a ZK connection to accept that update? Maybe so for defensive check reasons? Here's how I think we need to fix this: A) We need to figure out how long an update to a replica forwarded by the leader can reasonably take. Then we need to make the socket timeout be greater than that. B) We need to figure out how long an update to a leader can take (taking into account (A)), and make the socket timeout to the leader greater than that. C) We need to figure out how long an update to a non-leader (which is then forwarded to a leader) can take, and make the socket timeout from SolrJ servers to be greater than that. D) Make sure that the generic Jetty socket timeouts are greater than all of the above? If it's too hard to separate all these different socket timeouts now, then the best approximation would be to try and minimize the time any update can take, and raise all of the timeouts up high enough such that we should never see them. We should probably also take care to only retry in certain scenarios. For instance if we try to forward to a leader, but can't reach the leader. We should retry on connect timeout, but never on socket timeout. > ChaosMonkey test failures > ------------------------- > > Key: SOLR-3180 > URL: https://issues.apache.org/jira/browse/SOLR-3180 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Reporter: Yonik Seeley > Attachments: CMSL_fail1.log, CMSL_hang_2.txt, CMSL_hang.txt, > fail.130101_034142.txt, fail.130102_020942.txt, fail.130103_105104.txt, > fail.130103_193722.txt, fail.inconsistent.txt, test_report_1.txt > > > Handle intermittent failures in the ChaosMonkey tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org