[
https://issues.apache.org/jira/browse/SOLR-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris M. Hostetter updated SOLR-16753:
--------------------------------------
Attachment: SOLR-16753.txt
Status: Open (was: Open)
I'm attaching a log file from when i was able to trigger this failure locally
(running {{{}gradle clean check{}}}) _after_ committing SOLR-16751. But
unfortunately the seed doesn't reproduce – making me suspicious that it's a
timing related problem exacerbated by high CPU load (ie: jenkins box or running
lots of concurrent tests)
The problem almost seems like it must be to be related to the ZK watchers
and/or reading stale state ?
Here's the {{SliceMutator}} updating the state of the old shard and the new
split shards, which triggers {{zkCallback}} threads, which triggers thewatcher
set by the test -- and yet it still doesn't see the expected number of active
slices...
{noformat}
2> 382813 INFO
(recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr
coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10)
[n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10
x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.o.SliceMutator Update shard state
invoked for collection: coll_NRT_PULL with message: {
2> "collection":"coll_NRT_PULL",
2> "shard1_1":"active",
2> "operation":"updateshardstate",
2> "shard1_0":"active",
2> "shard1":"inactive"}
2> 382813 INFO
(recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr
coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10)
[n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10
x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.o.SliceMutator Update shard state
shard1_1 to active
2> 382813 INFO
(recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr
coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10)
[n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10
x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.o.SliceMutator Update shard state
shard1_0 to active
2> 382813 INFO
(recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr
coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10)
[n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10
x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.o.SliceMutator Update shard state
shard1 to inactive
2> 382815 INFO (zkCallback-3873-thread-1) [] o.a.s.c.c.ZkStateReader A
cluster state change: [WatchedEvent state:SyncConnected
type:NodeChildrenChanged path:/collections/coll_NRT_PULL/state.json] for
collection [coll_NRT_PULL] has occurred - updating... (live nodes size: [6])
2> 382815 INFO (zkCallback-3857-thread-2) [] o.a.s.c.c.ZkStateReader A
cluster state change: [WatchedEvent state:SyncConnected
type:NodeChildrenChanged path:/collections/coll_NRT_PULL/state.json] for
collection [coll_NRT_PULL] has occurred - updating... (live nodes size: [6])
2> 382817 INFO (zkCallback-3854-thread-1) [] o.a.s.c.c.ZkStateReader A
cluster state change: [WatchedEvent state:SyncConnected
type:NodeChildrenChanged path:/collections/coll_NRT_PULL/state.json] for
collection [coll_NRT_PULL] has occurred - updating... (live nodes size: [6])
2> 382818 INFO
(recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr
coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10)
[n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10
x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.RecoveryStrategy Finished recovery
process, successful=[true] msTimeTaken=84.0
2> 382818 INFO
(recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr
coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10)
[n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10
x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.RecoveryStrategy Finished recovery
process. recoveringAfterStartup=true msTimeTaken=85.0
2> 382818 INFO (watches-3871-thread-1) [] o.a.s.c.SolrCloudTestCase active
slice count: 1 expected: 2
{noformat}
[~noble.paul] - can you please try to dig into this?
> SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull failures
> -----------------------------------------------------------------------
>
> Key: SOLR-16753
> URL: https://issues.apache.org/jira/browse/SOLR-16753
> Project: Solr
> Issue Type: Test
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Chris M. Hostetter
> Assignee: Noble Paul
> Priority: Major
> Attachments: SOLR-16753.txt
>
>
> {{SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull}} – was
> added on 2023-03-13, but somwhere between 2023-04-02 and 2023-04-09 it
> started failing 15-20% on jenkins jobs with seeds that don't reliably
> reproduce.
> At first, this seemed like it might be related to SOLR-16751, but even with
> that fix failures are still happening.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]