[
https://issues.apache.org/jira/browse/SOLR-12923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793318#comment-16793318
]
ASF subversion and git services commented on SOLR-12923:
--------------------------------------------------------
Commit 76babf876a49f82959cc36a1d7ef922a9c2dddff in lucene-solr's branch
refs/heads/master from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=76babf8 ]
SOLR-12923: Fix some issues w/concurrency and exception swallowing in
SimClusterStateProvider/SimCloudManager
There are 3 tightly related bug fixes in these changes:
1) ConcurrentModificationExceptions were being thrown by some
SimClusterStateProvider methods when
creating collections/replicas due to the use of ArrayLists nodeReplicaMap.
These ArrayLists were changed
to use synchronizedList wrappers.
2) The Exceptions from #1 were being swallowed/hidden by code using
SimCloudManager.submit() w/o checking
the result of the resulting Future object. (As a result, tests waiting for a
particular ClusterShape
would timeout regardless of how long they waited.) To protect against
"silent" failures like this,
this SimCloudManager.submit() has been updated to wrap all input Callables
such that any uncaught errors
will be logged and "counted." SimSolrCloudTestCase will ensure a suite
level failure if any such failures
are counted.
3) The changes in #2 exposed additional concurrency problems with the Callables
involved in leader election:
These would frequently throw IllegalStateExceptions due to assumptions about
the state/existence of
replicas when the Callables were created vs when they were later run --
notably a Callable may have been
created that held a reference to a Slice, but by the time that Callable was
run the collection (or a
node, etc...) refered to by that Slice may have been deleted. While fixing
this, the leader election
logic was also cleaned up such that adding a replica only triggers leader
election for that shard, not
every shard in the collection.
While auditing this code, cleanup was also done to ensure all usage of
SimClusterStateProvider.lock was
also cleaned up to remove all risky points where an exception may have been
possible after aquiring the
lock but before the try/finally that ensured it would be unlocked.
> The new AutoScaling tests are way to flaky and need special attention.
> ----------------------------------------------------------------------
>
> Key: SOLR-12923
> URL: https://issues.apache.org/jira/browse/SOLR-12923
> Project: Solr
> Issue Type: Sub-task
> Security Level: Public(Default Security Level. Issues are Public)
> Components: Tests
> Reporter: Mark Miller
> Priority: Major
>
> I've already done some work here (not posted yet). We need to address this,
> these tests are too new to fail so often and easily.
> I want to add beasting to precommit (LUCENE-8545) to help prevent tests that
> fail so easily from being committed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]