[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14285361#comment-14285361
 ] 

Chun Chen commented on ZOOKEEPER-706:
-------------------------------------

I think not only large number of watches might cause the error, but also too 
many ops in a single call of ZooKeeper#multi. I'm seeing the following error 
RM log:
{code}
2015-01-20 09:45:45,464 WARN org.apache.zookeeper.ClientCnxn: Session 
0x24aeb334e8e000d for server c112/10.149.27.112:2181, unexpected error, closing 
socket connection and attempting reconn
ect
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470)
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
2015-01-20 09:45:45,564 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
        at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:895)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:892)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1031)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1050)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:892)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:906)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:915)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeRMDTMasterKeyState(ZKRMStateStore.java:785)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeRMDTMasterKey(RMStateStore.java:678)
        at 
org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.storeNewMasterKey(RMDelegationTokenSecretManager.java:86)
        at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.updateCurrentKey(AbstractDelegationTokenSecretManager.java:233)
        at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.startThreads(AbstractDelegationTokenSecretManager.java:116)
        at 
org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.serviceStart(RMSecretManagerService.java:83)
        at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:514)
        at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:847)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$2.run(ResourceManager.java:888)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$2.run(ResourceManager.java:884)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:884)
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:275)
        at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
        at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
        at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)


2015-01-20 09:59:49,665 WARN org.apache.zookeeper.ClientCnxn: Session 
0x34aeb34ba360012 for server c114/10.149.27.114:2181, unexpected error, closing 
socket connection and attempting reconnect
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470)
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
2015-01-20 09:59:49,765 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
        at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:895)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:892)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1031)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1050)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:892)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:678)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:184)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:170)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:769)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:842)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:837)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:191)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:124)
        at java.lang.Thread.run(Thread.java:745)
{code}

zk log
{code}
2015-01-20 09:45:46,377 [myid:2] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted 
socket connection from /10.149.27.114:33517
2015-01-20 09:45:46,381 [myid:2] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client 
attempting to renew session 0x24aeb334e8e000d at /10.149.27.114:33517
2015-01-20 09:45:46,381 [myid:2] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617] - Established 
session 0x24aeb334e8e000d with negotiated timeout 10000 for client 
/10.149.27.114:33517
2015-01-20 09:45:46,381 [myid:2] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@892] - got auth 
packet /10.149.27.114:33517
2015-01-20 09:45:46,381 [myid:2] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@926] - auth success 
/10.149.27.114:33517
2015-01-20 09:45:46,481 [myid:2] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
causing close of session 0x24aeb334e8e000d due to java.io.IOException: Len 
error 9271869
2015-01-20 09:45:46,481 [myid:2] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket 
connection for client /10.149.27.114:33517 which had sessionid 0x24aeb334e8e000d
{code}

> large numbers of watches can cause session re-establishment to fail
> -------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-706
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-706
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: c client, java client
>    Affects Versions: 3.1.2, 3.2.2, 3.3.0
>            Reporter: Patrick Hunt
>            Priority: Critical
>             Fix For: 3.5.1
>
>
> If a client sets a large number of watches the "set watches" operation during 
> session re-establishment can fail.
> for example:
>  WARN  [NIOServerCxn.Factory:22801:NIOServerCnxn@417] - Exception causing 
> close of session 0xe727001201a4ee7c due to java.io.IOException: Len error 
> 4348380
> in this case the client was a web monitoring app and had set both data and 
> child watches on > 32k znodes.
> there are two issues I see here we need to fix:
> 1) handle this case properly (split up the set watches into multiple calls I 
> guess...)
> 2) the session should have expired after the "timeout". however we seem to 
> consider any message from the client as re-setting the expiration on the 
> server side. Probably we should only consider messages from the client that 
> are sent during an established session, otherwise we can see this situation 
> where the session is not established however the session is not expired 
> either. Perhaps we should create another JIRA for this particular issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to