[ 
https://issues.apache.org/jira/browse/SOLR-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123736#comment-15123736
 ] 

Stephan Lagraulet edited comment on SOLR-3274 at 1/29/16 4:51 PM:
------------------------------------------------------------------

I think that this message could be misleading according to what is happening to 
our servers.

First we receive a disconnect event.
{code}
2016-01-29 14:52:38.868 INFO  
(zkCallback-3-thread-73-processing-n:solrnode027:8983_solr-EventThread) [   ] 
o.a.s.c.c.ConnectionManager Watcher 
org.apache.solr.common.cloud.ConnectionManager@478ca953 
name:ZooKeeperConnection 
Watcher:solrnode013:2181,solrnode014:2181,solrnode015:2181 got event 
WatchedEvent state:Disconnected type:None path:null path:null type:None
2016-01-29 14:52:38.868 INFO  
(zkCallback-3-thread-73-processing-n:solrnode027:8983_solr-EventThread) [   ] 
o.a.s.c.c.ConnectionManager zkClient has disconnected
{code}

Then we receive an "Expired" event:
{code}
2016-01-29 14:52:40.028 INFO  
(zkCallback-3-thread-73-processing-n:solrnode027:8983_solr-EventThread) [   ] 
o.a.s.c.c.ConnectionManager Watcher 
org.apache.solr.common.cloud.ConnectionManager@478ca953 
name:ZooKeeperConnection 
Watcher:solrnode013:2181,solrnode014:2181,solrnode015:2181 got event 
WatchedEvent state:Expired type:None path:null path:null type:None
2016-01-29 14:52:40.028 INFO  
(zkCallback-3-thread-73-processing-n:solrnode027:8983_solr-EventThread) [   ] 
o.a.s.c.c.ConnectionManager Our previous ZooKeeper session was expired. 
Attempting to reconnect to recover relationship with ZooKeeper...
2016-01-29 14:52:40.029 INFO  
(zkCallback-3-thread-73-processing-n:solrnode027:8983_solr-EventThread) [   ] 
o.a.s.c.c.DefaultConnectionStrategy Connection expired - starting a new one...
{code}


Then we have the message "Cannot talk to ZooKeeper"
{code}
2016-01-29 14:52:40.034 ERROR (qtp1057941451-80095) [c:offers_full s:shard3 
r:core_node7 x:offers_full_shard3_replica2] o.a.s.h.RequestHandlerBase 
org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates are 
disabled.
        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:1459)
        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:660)
        at 
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
        at 
org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:98)
        at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:179)
        at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:135)
        at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:260)
        at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:121)
        at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:225)
        at 
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:145)
        at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:186)
{code}

After this the connexion is established:
{code}
2016-01-29 14:52:40.393 INFO  
(zkCallback-3-thread-73-processing-n:a01solrf027.cdweb.biz:8983_solr-EventThread)
 [   ] o.a.s.c.c.ConnectionManager Watcher 
org.apache.solr.common.cloud.ConnectionManager@478ca953 
name:ZooKeeperConnection 
Watcher:a01solrf013.cdweb.biz:2181,a01solrf014.cdweb.biz:2181,a01solrf015.cdweb.biz:2181
 got event WatchedEvent state:SyncConnected type:None path:null path:null 
type:None
2016-01-29 14:52:40.393 INFO  
(zkCallback-3-thread-73-processing-n:a01solrf027.cdweb.biz:8983_solr-EventThread)
 [   ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper
2016-01-29 14:52:40.393 INFO  
(zkCallback-3-thread-73-processing-n:a01solrf027.cdweb.biz:8983_solr-EventThread)
 [   ] o.a.s.c.c.ConnectionManager Connection with ZooKeeper reestablished.
2016-01-29 14:52:40.393 INFO  
(zkCallback-3-thread-73-processing-n:a01solrf027.cdweb.biz:8983_solr-EventThread)
 [   ] o.a.s.c.ZkController ZooKeeper session re-connected ... refreshing core 
states after session expiration.
{code}

It is not that we cannot talk to ZK but instead we are currently establishing a 
new connection. We ought to wait until this connection is established before 
discarding these updates.

The other question is why do we get these events in the first place?
This server was the leader and this event happens whenever a replica starts a 
full recovery against this leader.


was (Author: stephlag):
I think that this message could be misleading according to what is happening to 
our servers:

First we receive a disconnect event.
{code}
2016-01-29 14:52:38.868 INFO  
(zkCallback-3-thread-73-processing-n:solrnode027:8983_solr-EventThread) [   ] 
o.a.s.c.c.ConnectionManager Watcher 
org.apache.solr.common.cloud.ConnectionManager@478ca953 
name:ZooKeeperConnection 
Watcher:solrnode013:2181,solrnode014:2181,solrnode015:2181 got event 
WatchedEvent state:Disconnected type:None path:null path:null type:None
2016-01-29 14:52:38.868 INFO  
(zkCallback-3-thread-73-processing-n:solrnode027:8983_solr-EventThread) [   ] 
o.a.s.c.c.ConnectionManager zkClient has disconnected
{code}

Then we receive an "Expired" event:
{code}
2016-01-29 14:52:40.028 INFO  
(zkCallback-3-thread-73-processing-n:solrnode027:8983_solr-EventThread) [   ] 
o.a.s.c.c.ConnectionManager Watcher 
org.apache.solr.common.cloud.ConnectionManager@478ca953 
name:ZooKeeperConnection 
Watcher:solrnode013:2181,solrnode014:2181,solrnode015:2181 got event 
WatchedEvent state:Expired type:None path:null path:null type:None
2016-01-29 14:52:40.028 INFO  
(zkCallback-3-thread-73-processing-n:solrnode027:8983_solr-EventThread) [   ] 
o.a.s.c.c.ConnectionManager Our previous ZooKeeper session was expired. 
Attempting to reconnect to recover relationship with ZooKeeper...
2016-01-29 14:52:40.029 INFO  
(zkCallback-3-thread-73-processing-n:solrnode027:8983_solr-EventThread) [   ] 
o.a.s.c.c.DefaultConnectionStrategy Connection expired - starting a new one...
{code}

Then the connexion is established:
{code}
2016-01-29 14:52:40.393 INFO  
(zkCallback-3-thread-73-processing-n:a01solrf027.cdweb.biz:8983_solr-EventThread)
 [   ] o.a.s.c.c.ConnectionManager Watcher 
org.apache.solr.common.cloud.ConnectionManager@478ca953 
name:ZooKeeperConnection 
Watcher:a01solrf013.cdweb.biz:2181,a01solrf014.cdweb.biz:2181,a01solrf015.cdweb.biz:2181
 got event WatchedEvent state:SyncConnected type:None path:null path:null 
type:None
2016-01-29 14:52:40.393 INFO  
(zkCallback-3-thread-73-processing-n:a01solrf027.cdweb.biz:8983_solr-EventThread)
 [   ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper
2016-01-29 14:52:40.393 INFO  
(zkCallback-3-thread-73-processing-n:a01solrf027.cdweb.biz:8983_solr-EventThread)
 [   ] o.a.s.c.c.ConnectionManager Connection with ZooKeeper reestablished.
2016-01-29 14:52:40.393 INFO  
(zkCallback-3-thread-73-processing-n:a01solrf027.cdweb.biz:8983_solr-EventThread)
 [   ] o.a.s.c.ZkController ZooKeeper session re-connected ... refreshing core 
states after session expiration.
{code}

Then we have the message "Cannot talk to ZooKeeper"
{code}
2016-01-29 14:52:40.034 ERROR (qtp1057941451-80095) [c:offers_full s:shard3 
r:core_node7 x:offers_full_shard3_replica2] o.a.s.h.RequestHandlerBase 
org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates are 
disabled.
        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:1459)
        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:660)
        at 
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
        at 
org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:98)
        at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:179)
        at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:135)
        at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:260)
        at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:121)
        at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:225)
        at 
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:145)
        at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:186)
{code}



> ZooKeeper related SolrCloud problems
> ------------------------------------
>
>                 Key: SOLR-3274
>                 URL: https://issues.apache.org/jira/browse/SOLR-3274
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0-ALPHA
>         Environment: Any
>            Reporter: Per Steffensen
>
> Same setup as in SOLR-3273. Well if I have to tell the entire truth we have 7 
> Solr servers, running 28 slices of the same collection (collA) - all slices 
> have one replica (two shards all in all - leader + replica) - 56 cores all in 
> all (8 shards on each solr instance). But anyways...
> Besides the problem reported in SOLR-3273, the system seems to run fine under 
> high load for several hours, but eventually errors like the ones shown below 
> start to occur. I might be wrong, but they all seem to indicate some kind of 
> unstability in the collaboration between Solr and ZooKeeper. I have to say 
> that I havnt been there to check ZooKeeper "at the moment where those 
> exception occur", but basically I dont believe the exceptions occur because 
> ZooKeeper is not running stable - at least when I go and check ZooKeeper 
> through other "channels" (e.g. my eclipse ZK plugin) it is always accepting 
> my connection and generally seems to be doing fine.
> Exception 1) Often the first error we see in solr.log is something like this
> {code}
> Mar 22, 2012 5:06:43 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - 
> Updates are disabled.
>         at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:678)
>         at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:250)
>         at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)
>         at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:80)
>         at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:59)
>         at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:407)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
>         at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>         at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>         at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>         at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>         at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>         at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>         at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>         at 
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>         at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>         at org.mortbay.jetty.Server.handle(Server.java:326)
>         at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>         at 
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
>         at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
>         at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>         at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>         at 
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
>         at 
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> {code}
> I believe this error basically occurs because SolrZkClient.isConnected 
> reports false, which means that its internal "keeper.getState" does not 
> return ZooKeeper.States.CONNECTED. Im pretty sure that it has been CONNECTED 
> for a long time, since this error starts occuring after several hours of 
> processing without this problem showing. But why is it suddenly not connected 
> anymore?!
> Exception 2) We also see errors like the following, and if Im not mistaken, 
> they start occuring shortly after "Exception 1)" (above) shows for the fist 
> time
> {code}
> Mar 22, 2012 5:07:26 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: no servers hosting shard: 
>         at 
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:149)
>         at 
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:123)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> {code}
> Please note that the exception says "no servers hosting shard: <blank>". 
> Looking at the code a "shard"-string was actually supposed to be written at 
> <blank>.  Basically this means that HttpShardHandler.submit was called with 
> an empty "shard"-string parameter. But who does this? 
> CoreAdminHandler.handleDistribUrlAction or SearchHandler.handleRequestBody or 
> SyncStrategy or PeerSync or... I dont know, and maybe it is not that 
> relevant, because I guess they all get the "shard"-string from ZooKeeper. 
> Again something pointing in the direction of unstable collaboration between 
> Solr and ZooKeeper.
> Exception 3) We also see exceptions like this
> {code}
> Mar 25, 2012 3:05:38 PM org.apache.solr.common.cloud.ZkStateReader$3 process
> WARNING: ZooKeeper watch triggered, but Solr cannot talk to ZK
> Mar 25, 2012 3:05:38 PM org.apache.solr.cloud.LeaderElector$1 process
> WARNING: 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for /collections/collA/leader_elect/slice26/election
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1249)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:266)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:263)
>         at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>         at 
> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:263)
>         at 
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:92)
>         at 
> org.apache.solr.cloud.LeaderElector.access$000(LeaderElector.java:57)
>         at 
> org.apache.solr.cloud.LeaderElector$1.process(LeaderElector.java:121)
>         at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
>         at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> {code}
> Maybe this will we usable for some bug-fixing or for making the code more 
> stable. I know 4.0 is not stable/released yet, and that we therefore should 
> expect this kind of errors at the moment. So this is not negative criticism - 
> just reporting of issues observed when using SolrCloud features under high 
> load for several days. Any feedback is more than welcome.
> Regards, Per Steffensen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to