Re: [JENKINS] Lucene-Solr-Tests-trunk-java7 - Build # 3917 - Failure

Steve Rowe Mon, 22 Apr 2013 11:41:24 -0700

TestCloudManagedSchemaAddField is part of SOLR-3251, which I committed to trunk 
earlier today.


This test starts up one collection with 8 replicas, then adds 25 new fields.  
After each add field request, which is sent to a random replica, all replicas' 
schemas are queried via the schema REST API to insure that they have the new 
field.

The failure here, AFAICT, is timing: After a new field is added, there is a 
delay before all replicas know about it.  On my macbook pro, it was always 
sufficient to simply retry schema REST API requests, with no waiting - this 
never took more than a few retries (and retries were always required only on 
the leader), so I capped it at 10 retries, thinking that this would be 
sufficient.

But Jenkins is laggier than my Mac (surprise, surprise).  This failure occurs 
with the first added field.  I can see that all replicas get notified of a 
newly persisted schema within one millisecond (all other replicas' log entries 
had the same time stamp):

-----
[junit4:junit4]   1> INFO  - 2013-04-22 16:26:40.661; 
org.apache.solr.schema.ManagedIndexSchema; Persisted managed schema at 
/configs/conf1/managed-schema
[junit4:junit4]   1> INFO  - 2013-04-22 16:26:40.661; 
org.apache.solr.schema.ZkIndexSchemaReader$1; A schema change: WatchedEvent 
state:SyncConnected type:NodeDataChanged path:/configs/conf1/managed-schema, 
has occurred - updating schema from ZooKeeper …
-----

The first replica retrieved the new schema within a few milliseconds, but the 
last one took a lot longer, and when all of the schema REST API request retries 
were finished, at least one replica had not yet retrieved and refreshed its 
schema.  In this case, as on my mac, it's the leader that needed to be retried:

-----
[junit4:junit4]   1> INFO  - 2013-04-22 16:26:40.664; 
org.apache.solr.schema.ZkIndexSchemaReader; Retrieved schema from ZooKeeper
[…]
[junit4:junit4]   2> ASYNC  NEW_CORE C6 name=collection1 
org.apache.solr.core.SolrCore@68f6cbbd url=http://127.0.0.1:18511/collection1 
node=127.0.0.1:18511_ C6_STATE=coll:collection1 core:collection1 
props:{shard=shard1, state=active, core=collection1, collection=collection1, 
node_name=127.0.0.1:18511_, base_url=http://127.0.0.1:18511, leader=true}
[junit4:junit4]   2> 37051 T2029 C6 P18511 orel.LogFilter.afterHandle 
2013-04-22        19:26:40        140.211.11.196  -       140.211.11.196  18511 
  GET     /schema/fields/newfield1        wt=xml  404     -       0       6     
  http://127.0.0.1:18511  Java/1.7.0_17   -
[…]
[junit4:junit4]   1> INFO  - 2013-04-22 16:26:40.765; 
org.apache.solr.schema.ZkIndexSchemaReader; Retrieved schema from ZooKeeper
[…]
[junit4:junit4]   1> INFO  - 2013-04-22 16:26:40.780; 
org.apache.solr.schema.ZkIndexSchemaReader; Finished refreshing schema in 17 ms
-----

So for some reason it took a long time (max 119ms in this case) for the schema 
change to be available in the slowest replica.

I can see a ChaosMonkey shard stop occurring in there at around the same time, 
and two different leaders reported, so I'm thinking the delay here is an 
intentional thwack that we're supposed to be able to survive.

So I think the fix is to add a small wait (maybe 10ms) between retries (this 
retry logic is only in test code), and bump up the number of allowed retries to 
20.

I'll make that change on trunk.

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [JENKINS] Lucene-Solr-Tests-trunk-java7 - Build # 3917 - Failure

Reply via email to