TestCloudManagedSchemaAddField is part of SOLR-3251, which I committed to trunk earlier today.
This test starts up one collection with 8 replicas, then adds 25 new fields. After each add field request, which is sent to a random replica, all replicas' schemas are queried via the schema REST API to insure that they have the new field. The failure here, AFAICT, is timing: After a new field is added, there is a delay before all replicas know about it. On my macbook pro, it was always sufficient to simply retry schema REST API requests, with no waiting - this never took more than a few retries (and retries were always required only on the leader), so I capped it at 10 retries, thinking that this would be sufficient. But Jenkins is laggier than my Mac (surprise, surprise). This failure occurs with the first added field. I can see that all replicas get notified of a newly persisted schema within one millisecond (all other replicas' log entries had the same time stamp): ----- [junit4:junit4] 1> INFO - 2013-04-22 16:26:40.661; org.apache.solr.schema.ManagedIndexSchema; Persisted managed schema at /configs/conf1/managed-schema [junit4:junit4] 1> INFO - 2013-04-22 16:26:40.661; org.apache.solr.schema.ZkIndexSchemaReader$1; A schema change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/configs/conf1/managed-schema, has occurred - updating schema from ZooKeeper … ----- The first replica retrieved the new schema within a few milliseconds, but the last one took a lot longer, and when all of the schema REST API request retries were finished, at least one replica had not yet retrieved and refreshed its schema. In this case, as on my mac, it's the leader that needed to be retried: ----- [junit4:junit4] 1> INFO - 2013-04-22 16:26:40.664; org.apache.solr.schema.ZkIndexSchemaReader; Retrieved schema from ZooKeeper […] [junit4:junit4] 2> ASYNC NEW_CORE C6 name=collection1 org.apache.solr.core.SolrCore@68f6cbbd url=http://127.0.0.1:18511/collection1 node=127.0.0.1:18511_ C6_STATE=coll:collection1 core:collection1 props:{shard=shard1, state=active, core=collection1, collection=collection1, node_name=127.0.0.1:18511_, base_url=http://127.0.0.1:18511, leader=true} [junit4:junit4] 2> 37051 T2029 C6 P18511 orel.LogFilter.afterHandle 2013-04-22 19:26:40 140.211.11.196 - 140.211.11.196 18511 GET /schema/fields/newfield1 wt=xml 404 - 0 6 http://127.0.0.1:18511 Java/1.7.0_17 - […] [junit4:junit4] 1> INFO - 2013-04-22 16:26:40.765; org.apache.solr.schema.ZkIndexSchemaReader; Retrieved schema from ZooKeeper […] [junit4:junit4] 1> INFO - 2013-04-22 16:26:40.780; org.apache.solr.schema.ZkIndexSchemaReader; Finished refreshing schema in 17 ms ----- So for some reason it took a long time (max 119ms in this case) for the schema change to be available in the slowest replica. I can see a ChaosMonkey shard stop occurring in there at around the same time, and two different leaders reported, so I'm thinking the delay here is an intentional thwack that we're supposed to be able to survive. So I think the fix is to add a small wait (maybe 10ms) between retries (this retry logic is only in test code), and bump up the number of allowed retries to 20. I'll make that change on trunk. Steve --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
