ZooKeeper related SolrCloud problems
------------------------------------
Key: SOLR-3274
URL: https://issues.apache.org/jira/browse/SOLR-3274
Project: Solr
Issue Type: Bug
Components: SolrCloud
Affects Versions: 4.0
Environment: Any
Reporter: Per Steffensen
Same setup as in SOLR-3273. Well if I have to tell the entire truth we have 7
Solr servers, running 28 slices of the same collection (collA) - all slices
have one replica (two shards all in all - leader + replica) - 56 cores all in
all (8 shards on each solr instance). But anyways...
Besides the problem reported in SOLR-3273, the system seems to run fine under
high load for several hours, but eventually errors like the ones shown below
start to occur. I might be wrong, but they all seem to indicate some kind of
unstability in the collaboration between Solr and ZooKeeper. I have to say that
I havnt been there to check ZooKeeper "at the moment where those exception
occur", but basically I dont believe the exceptions occur because ZooKeeper is
not running stable - at least when I go and check ZooKeeper through other
"channels" (e.g. my eclipse ZK plugin) it is always accepting my connection and
generally seems to be doing fine.
Exception 1) Often the first error we see in solr.log is something like this
{code}
Mar 22, 2012 5:06:43 AM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: Cannot talk to ZooKeeper -
Updates are disabled.
at
org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:678)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:250)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:80)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:59)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:407)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
{code}
I believe this error basically occurs because SolrZkClient.isConnected reports
false, which means that its internal "keeper.getState" does not return
ZooKeeper.States.CONNECTED. Im pretty sure that it has been CONNECTED for a
long time, since this error starts occuring after several hours of processing
without this problem showing. But why is it suddenly not connected anymore?!
Exception 2) We also see errors like the following, and if Im not mistaken,
they start occuring shortly after "Exception 1)" (above) shows for the fist time
{code}
Mar 22, 2012 5:07:26 AM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:149)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:123)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
{code}
Please note that the exception says "no servers hosting shard: <blank>".
Looking at the code a "shard"-string was actually supposed to be written at
<blank>. Basically this means that HttpShardHandler.submit was called with an
empty "shard"-string parameter. But who does this?
CoreAdminHandler.handleDistribUrlAction or SearchHandler.handleRequestBody or
SyncStrategy or PeerSync or... I dont know, and maybe it is not that relevant,
because I guess they all get the "shard"-string from ZooKeeper. Again something
pointing in the direction of unstable collaboration between Solr and ZooKeeper.
Exception 3) We also see exceptions like this
{code}
Mar 25, 2012 3:05:38 PM org.apache.solr.common.cloud.ZkStateReader$3 process
WARNING: ZooKeeper watch triggered, but Solr cannot talk to ZK
Mar 25, 2012 3:05:38 PM org.apache.solr.cloud.LeaderElector$1 process
WARNING:
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /collections/collA/leader_elect/slice26/election
at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1249)
at
org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:266)
at
org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:263)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
at
org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:263)
at
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:92)
at org.apache.solr.cloud.LeaderElector.access$000(LeaderElector.java:57)
at org.apache.solr.cloud.LeaderElector$1.process(LeaderElector.java:121)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
{code}
Maybe this will we usable for some bug-fixing or for making the code more
stable. I know 4.0 is not stable/released yet, and that we therefore should
expect this kind of errors at the moment. So this is not negative criticism -
just reporting of issues observed when using SolrCloud features under high load
for several days. Any feedback is more than welcome.
Regards, Per Steffensen
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]