[jira] [Updated] (SOLR-16753) SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull failures

Chris M. Hostetter (Jira) Tue, 18 Apr 2023 12:05:06 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris M. Hostetter updated SOLR-16753:
--------------------------------------
    Attachment: SOLR-16753.txt
        Status: Open  (was: Open)

I'm attaching a log file from when i was able to trigger this failure locally 
(running {{{}gradle clean check{}}}) _after_ committing SOLR-16751.  But 
unfortunately the seed doesn't reproduce – making me suspicious that it's a 
timing related problem exacerbated by high CPU load (ie: jenkins box or running 
lots of concurrent tests)

 

The problem almost seems like it must be to be related to the ZK watchers 
and/or reading stale state ?

 
Here's the {{SliceMutator}} updating the state of the old shard and the new 
split shards, which triggers {{zkCallback}} threads, which triggers thewatcher 
set by the test -- and yet it still doesn't see the expected number of active 
slices...

{noformat}
  2> 382813 INFO  
(recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr 
coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10) 
[n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10 
x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.o.SliceMutator Update shard state 
invoked for collection: coll_NRT_PULL with message: {
  2>   "collection":"coll_NRT_PULL",
  2>   "shard1_1":"active",
  2>   "operation":"updateshardstate",
  2>   "shard1_0":"active",
  2>   "shard1":"inactive"}
  2> 382813 INFO  
(recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr 
coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10) 
[n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10 
x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.o.SliceMutator Update shard state 
shard1_1 to active
  2> 382813 INFO  
(recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr 
coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10) 
[n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10 
x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.o.SliceMutator Update shard state 
shard1_0 to active
  2> 382813 INFO  
(recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr 
coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10) 
[n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10 
x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.o.SliceMutator Update shard state 
shard1 to inactive
  2> 382815 INFO  (zkCallback-3873-thread-1) [] o.a.s.c.c.ZkStateReader A 
cluster state change: [WatchedEvent state:SyncConnected 
type:NodeChildrenChanged path:/collections/coll_NRT_PULL/state.json] for 
collection [coll_NRT_PULL] has occurred - updating... (live nodes size: [6])
  2> 382815 INFO  (zkCallback-3857-thread-2) [] o.a.s.c.c.ZkStateReader A 
cluster state change: [WatchedEvent state:SyncConnected 
type:NodeChildrenChanged path:/collections/coll_NRT_PULL/state.json] for 
collection [coll_NRT_PULL] has occurred - updating... (live nodes size: [6])
  2> 382817 INFO  (zkCallback-3854-thread-1) [] o.a.s.c.c.ZkStateReader A 
cluster state change: [WatchedEvent state:SyncConnected 
type:NodeChildrenChanged path:/collections/coll_NRT_PULL/state.json] for 
collection [coll_NRT_PULL] has occurred - updating... (live nodes size: [6])
  2> 382818 INFO  
(recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr 
coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10) 
[n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10 
x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.RecoveryStrategy Finished recovery 
process, successful=[true] msTimeTaken=84.0
  2> 382818 INFO  
(recoveryExecutor-3853-thread-1-processing-127.0.0.1:38817_solr 
coll_NRT_PULL_shard1_1_replica_p1 coll_NRT_PULL shard1_1 core_node10) 
[n:127.0.0.1:38817_solr c:coll_NRT_PULL s:shard1_1 r:core_node10 
x:coll_NRT_PULL_shard1_1_replica_p1] o.a.s.c.RecoveryStrategy Finished recovery 
process. recoveringAfterStartup=true msTimeTaken=85.0
  2> 382818 INFO  (watches-3871-thread-1) [] o.a.s.c.SolrCloudTestCase active 
slice count: 1 expected: 2
{noformat} 

[~noble.paul] - can you please try to dig into this?

> SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull failures
> -----------------------------------------------------------------------
>
>                 Key: SOLR-16753
>                 URL: https://issues.apache.org/jira/browse/SOLR-16753
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Assignee: Noble Paul
>            Priority: Major
>         Attachments: SOLR-16753.txt
>
>
> {{SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull}} – was 
> added on 2023-03-13, but somwhere between 2023-04-02 and 2023-04-09 it 
> started failing 15-20% on jenkins jobs with seeds that don't reliably 
> reproduce.
> At first, this seemed like it might be related to SOLR-16751, but even with 
> that fix failures are still happening.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-16753) SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull failures

Reply via email to