[ https://issues.apache.org/jira/browse/SOLR-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631384#comment-13631384 ]
Shalin Shekhar Mangar commented on SOLR-3755: --------------------------------------------- bq. Anshum suggested over chat that we should think about combining ShardSplitTest and ChaosMonkeyShardSplit tests into one to avoid code duplication. I'll try to see if we can do that. I've changed ChaosMonkeyShardSplitTest to extend ShardSplitTest so that we can share most of the code. The ChaosMonkey test is not completely correct and I intend to improve it. bq. The original change around this made preRegister start taking a core rather than a core descriptor. I'd like to work that out so it doesn't need to be the case. I'll revert the change to the preRegister method signature and find another way. I've found two kinds of test failures of (ChaosMonkey)ShardSplitTest. The first is because of the following sequence of events: # A doc addition fails (because of the kill leader jetty command), client throws an exception and therefore the docCount variable is not incremented inside the index thread. # However, the doc addition is recorded in the update logs (of the proxy node?) and replayed on the new leader so in reality, the doc does get added to the shard # Split happens and we assert on docCounts being equal in the server which fails because the server has the document that we have not counted. This happens mostly with Lucene-Solr-Tests-4.x-Java6 builds. The bug is in the tests and not in the split code. Following is the stack trace: {code} [junit4:junit4] 1> ERROR - 2013-04-14 14:24:27.697; org.apache.solr.cloud.ChaosMonkeyShardSplitTest$1; Exception while adding doc [junit4:junit4] 1> org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://127.0.0.1:34203/h/y/collection1, http://127.0.0.1:34304/h/y/collection1, http://127.0.0.1:34311/h/y/collection1, http://127.0.0.1:34270/h/y/collection1] [junit4:junit4] 1> at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:333) [junit4:junit4] 1> at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:306) [junit4:junit4] 1> at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) [junit4:junit4] 1> at org.apache.solr.cloud.AbstractFullDistribZkTestBase.indexDoc(AbstractFullDistribZkTestBase.java:561) [junit4:junit4] 1> at org.apache.solr.cloud.ChaosMonkeyShardSplitTest.indexr(ChaosMonkeyShardSplitTest.java:434) [junit4:junit4] 1> at org.apache.solr.cloud.ChaosMonkeyShardSplitTest$1.run(ChaosMonkeyShardSplitTest.java:158) [junit4:junit4] 1> Caused by: org.apache.solr.common.SolrException: Server at http://127.0.0.1:34311/h/y/collection1 returned non ok status:503, message:Service Unavailable [junit4:junit4] 1> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) [junit4:junit4] 1> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) [junit4:junit4] 1> at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264) [junit4:junit4] 1> ... 5 more {code} Perhaps we should check the exception message and continue to count such a document? The second kind of test failures are where a document add fails due to version conflict. This exception is always seen just after the "updateshardstate" is called to switch the shard states. Following is the relevant log: {code} [junit4:junit4] 1> INFO - 2013-04-14 19:05:26.861; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state invoked for collection: collection1 [junit4:junit4] 1> INFO - 2013-04-14 19:05:26.861; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state shard1 to inactive [junit4:junit4] 1> INFO - 2013-04-14 19:05:26.861; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state shard1_0 to active [junit4:junit4] 1> INFO - 2013-04-14 19:05:26.861; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state shard1_1 to active [junit4:junit4] 1> INFO - 2013-04-14 19:05:26.873; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp= path=/update params={wt=javabin&version=2} {add=[169 (1432319507166134272)]} 0 2 [junit4:junit4] 1> INFO - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4:junit4] 1> INFO - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4:junit4] 1> INFO - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4:junit4] 1> INFO - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4:junit4] 1> INFO - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4:junit4] 1> INFO - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4:junit4] 1> INFO - 2013-04-14 19:05:26.884; org.apache.solr.update.processor.LogUpdateProcessor; [collection1_shard1_1_replica1] webapp= path=/update params={distrib.from=http://127.0.0.1:41028/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2} {} 0 1 [junit4:junit4] 1> INFO - 2013-04-14 19:05:26.885; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp= path=/update params={distrib.from=http://127.0.0.1:41028/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2} {add=[169 (1432319507173474304)]} 0 2 [junit4:junit4] 1> ERROR - 2013-04-14 19:05:26.885; org.apache.solr.common.SolrException; shard update error StdNode: http://127.0.0.1:41028/collection1_shard1_1_replica1/:org.apache.solr.common.SolrException: version conflict for 169 expected=1432319507173474304 actual=-1 [junit4:junit4] 1> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:404) [junit4:junit4] 1> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) [junit4:junit4] 1> at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) [junit4:junit4] 1> at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) [junit4:junit4] 1> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [junit4:junit4] 1> at java.util.concurrent.FutureTask.run(FutureTask.java:166) [junit4:junit4] 1> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [junit4:junit4] 1> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [junit4:junit4] 1> at java.util.concurrent.FutureTask.run(FutureTask.java:166) [junit4:junit4] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) [junit4:junit4] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [junit4:junit4] 1> at java.lang.Thread.run(Thread.java:679) [junit4:junit4] 1> [junit4:junit4] 1> INFO - 2013-04-14 19:05:26.886; org.apache.solr.update.processor.DistributedUpdateProcessor; try and ask http://127.0.0.1:41028 to recover {code} I'm not sure yet why a version conflict will happen and why it follows an "updateshardstate" command. > shard splitting > --------------- > > Key: SOLR-3755 > URL: https://issues.apache.org/jira/browse/SOLR-3755 > Project: Solr > Issue Type: New Feature > Components: SolrCloud > Reporter: Yonik Seeley > Assignee: Shalin Shekhar Mangar > Fix For: 4.3, 5.0 > > Attachments: SOLR-3755-combined.patch, > SOLR-3755-combinedWithReplication.patch, SOLR-3755-CoreAdmin.patch, > SOLR-3755.patch, SOLR-3755.patch, SOLR-3755.patch, SOLR-3755.patch, > SOLR-3755.patch, SOLR-3755.patch, SOLR-3755.patch, SOLR-3755.patch, > SOLR-3755.patch, SOLR-3755.patch, SOLR-3755-testSplitter.patch, > SOLR-3755-testSplitter.patch > > > We can currently easily add replicas to handle increases in query volume, but > we should also add a way to add additional shards dynamically by splitting > existing shards. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org