[ 
https://issues.apache.org/jira/browse/SOLR-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631384#comment-13631384
 ] 

Shalin Shekhar Mangar commented on SOLR-3755:
---------------------------------------------

bq. Anshum suggested over chat that we should think about combining 
ShardSplitTest and ChaosMonkeyShardSplit tests into one to avoid code 
duplication. I'll try to see if we can do that.
I've changed ChaosMonkeyShardSplitTest to extend ShardSplitTest so that we can 
share most of the code. The ChaosMonkey test is not completely correct and I 
intend to improve it.

bq. The original change around this made preRegister start taking a core rather 
than a core descriptor. I'd like to work that out so it doesn't need to be the 
case.

I'll revert the change to the preRegister method signature and find another way.

I've found two kinds of test failures of (ChaosMonkey)ShardSplitTest.

The first is because of the following sequence of events:

# A doc addition fails (because of the kill leader jetty command), client 
throws an exception and therefore the docCount variable is not incremented 
inside the index thread.
# However, the doc addition is recorded in the update logs (of the proxy node?) 
and replayed on the new leader so in reality, the doc does get added to the 
shard
# Split happens and we assert on docCounts being equal in the server which 
fails because the server has the document that we have not counted.

This happens mostly with Lucene-Solr-Tests-4.x-Java6 builds. The bug is in the 
tests and not in the split code. Following is the stack trace:

{code}
[junit4:junit4]   1> ERROR - 2013-04-14 14:24:27.697; 
org.apache.solr.cloud.ChaosMonkeyShardSplitTest$1; Exception while adding doc
[junit4:junit4]   1> org.apache.solr.client.solrj.SolrServerException: No live 
SolrServers available to handle this 
request:[http://127.0.0.1:34203/h/y/collection1, 
http://127.0.0.1:34304/h/y/collection1, http://127.0.0.1:34311/h/y/collection1, 
http://127.0.0.1:34270/h/y/collection1]
[junit4:junit4]   1>    at 
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:333)
[junit4:junit4]   1>    at 
org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:306)
[junit4:junit4]   1>    at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
[junit4:junit4]   1>    at 
org.apache.solr.cloud.AbstractFullDistribZkTestBase.indexDoc(AbstractFullDistribZkTestBase.java:561)
[junit4:junit4]   1>    at 
org.apache.solr.cloud.ChaosMonkeyShardSplitTest.indexr(ChaosMonkeyShardSplitTest.java:434)
[junit4:junit4]   1>    at 
org.apache.solr.cloud.ChaosMonkeyShardSplitTest$1.run(ChaosMonkeyShardSplitTest.java:158)
[junit4:junit4]   1> Caused by: org.apache.solr.common.SolrException: Server at 
http://127.0.0.1:34311/h/y/collection1 returned non ok status:503, 
message:Service Unavailable
[junit4:junit4]   1>    at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
[junit4:junit4]   1>    at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
[junit4:junit4]   1>    at 
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264)
[junit4:junit4]   1>    ... 5 more
{code}

Perhaps we should check the exception message and continue to count such a 
document?

The second kind of test failures are where a document add fails due to version 
conflict. This exception is always seen just after the "updateshardstate" is 
called to switch the shard states. Following is the relevant log:

{code}
[junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.861; 
org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state invoked 
for collection: collection1
[junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.861; 
org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state shard1 
to inactive
[junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.861; 
org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state shard1_0 
to active
[junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.861; 
org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state shard1_1 
to active
[junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.873; 
org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp= 
path=/update params={wt=javabin&version=2} {add=[169 (1432319507166134272)]} 0 2
[junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; 
org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: 
WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, 
has occurred - updating... (live nodes size: 5)
[junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; 
org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: 
WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, 
has occurred - updating... (live nodes size: 5)
[junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; 
org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: 
WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, 
has occurred - updating... (live nodes size: 5)
[junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; 
org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: 
WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, 
has occurred - updating... (live nodes size: 5)
[junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; 
org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: 
WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, 
has occurred - updating... (live nodes size: 5)
[junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; 
org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: 
WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, 
has occurred - updating... (live nodes size: 5)
[junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.884; 
org.apache.solr.update.processor.LogUpdateProcessor; 
[collection1_shard1_1_replica1] webapp= path=/update 
params={distrib.from=http://127.0.0.1:41028/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2}
 {} 0 1
[junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.885; 
org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp= 
path=/update 
params={distrib.from=http://127.0.0.1:41028/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2}
 {add=[169 (1432319507173474304)]} 0 2
[junit4:junit4]   1> ERROR - 2013-04-14 19:05:26.885; 
org.apache.solr.common.SolrException; shard update error StdNode: 
http://127.0.0.1:41028/collection1_shard1_1_replica1/:org.apache.solr.common.SolrException:
 version conflict for 169 expected=1432319507173474304 actual=-1
[junit4:junit4]   1>    at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:404)
[junit4:junit4]   1>    at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
[junit4:junit4]   1>    at 
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
[junit4:junit4]   1>    at 
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
[junit4:junit4]   1>    at 
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
[junit4:junit4]   1>    at 
java.util.concurrent.FutureTask.run(FutureTask.java:166)
[junit4:junit4]   1>    at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
[junit4:junit4]   1>    at 
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
[junit4:junit4]   1>    at 
java.util.concurrent.FutureTask.run(FutureTask.java:166)
[junit4:junit4]   1>    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
[junit4:junit4]   1>    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[junit4:junit4]   1>    at java.lang.Thread.run(Thread.java:679)
[junit4:junit4]   1> 
[junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.886; 
org.apache.solr.update.processor.DistributedUpdateProcessor; try and ask 
http://127.0.0.1:41028 to recover
{code}

I'm not sure yet why a version conflict will happen and why it follows an 
"updateshardstate" command.
                
> shard splitting
> ---------------
>
>                 Key: SOLR-3755
>                 URL: https://issues.apache.org/jira/browse/SOLR-3755
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>            Reporter: Yonik Seeley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 4.3, 5.0
>
>         Attachments: SOLR-3755-combined.patch, 
> SOLR-3755-combinedWithReplication.patch, SOLR-3755-CoreAdmin.patch, 
> SOLR-3755.patch, SOLR-3755.patch, SOLR-3755.patch, SOLR-3755.patch, 
> SOLR-3755.patch, SOLR-3755.patch, SOLR-3755.patch, SOLR-3755.patch, 
> SOLR-3755.patch, SOLR-3755.patch, SOLR-3755-testSplitter.patch, 
> SOLR-3755-testSplitter.patch
>
>
> We can currently easily add replicas to handle increases in query volume, but 
> we should also add a way to add additional shards dynamically by splitting 
> existing shards.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to