[
https://issues.apache.org/jira/browse/ZOOKEEPER-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14285361#comment-14285361
]
Chun Chen commented on ZOOKEEPER-706:
-------------------------------------
I think not only large number of watches might cause the error, but also too
many ops in a single call of ZooKeeper#multi. I'm seeing the following error
RM log:
{code}
2015-01-20 09:45:45,464 WARN org.apache.zookeeper.ClientCnxn: Session
0x24aeb334e8e000d for server c112/10.149.27.112:2181, unexpected error, closing
socket connection and attempting reconn
ect
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
2015-01-20 09:45:45,564 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:895)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:892)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1031)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1050)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:892)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:906)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:915)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeRMDTMasterKeyState(ZKRMStateStore.java:785)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeRMDTMasterKey(RMStateStore.java:678)
at
org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.storeNewMasterKey(RMDelegationTokenSecretManager.java:86)
at
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.updateCurrentKey(AbstractDelegationTokenSecretManager.java:233)
at
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.startThreads(AbstractDelegationTokenSecretManager.java:116)
at
org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.serviceStart(RMSecretManagerService.java:83)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:514)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:847)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$2.run(ResourceManager.java:888)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$2.run(ResourceManager.java:884)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:884)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:275)
at
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
2015-01-20 09:59:49,665 WARN org.apache.zookeeper.ClientCnxn: Session
0x34aeb34ba360012 for server c114/10.149.27.114:2181, unexpected error, closing
socket connection and attempting reconnect
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
2015-01-20 09:59:49,765 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:895)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:892)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1031)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1050)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:892)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:678)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:184)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:170)
at
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:769)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:842)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:837)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:191)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:124)
at java.lang.Thread.run(Thread.java:745)
{code}
zk log
{code}
2015-01-20 09:45:46,377 [myid:2] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted
socket connection from /10.149.27.114:33517
2015-01-20 09:45:46,381 [myid:2] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client
attempting to renew session 0x24aeb334e8e000d at /10.149.27.114:33517
2015-01-20 09:45:46,381 [myid:2] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617] - Established
session 0x24aeb334e8e000d with negotiated timeout 10000 for client
/10.149.27.114:33517
2015-01-20 09:45:46,381 [myid:2] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@892] - got auth
packet /10.149.27.114:33517
2015-01-20 09:45:46,381 [myid:2] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@926] - auth success
/10.149.27.114:33517
2015-01-20 09:45:46,481 [myid:2] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception
causing close of session 0x24aeb334e8e000d due to java.io.IOException: Len
error 9271869
2015-01-20 09:45:46,481 [myid:2] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket
connection for client /10.149.27.114:33517 which had sessionid 0x24aeb334e8e000d
{code}
> large numbers of watches can cause session re-establishment to fail
> -------------------------------------------------------------------
>
> Key: ZOOKEEPER-706
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-706
> Project: ZooKeeper
> Issue Type: Bug
> Components: c client, java client
> Affects Versions: 3.1.2, 3.2.2, 3.3.0
> Reporter: Patrick Hunt
> Priority: Critical
> Fix For: 3.5.1
>
>
> If a client sets a large number of watches the "set watches" operation during
> session re-establishment can fail.
> for example:
> WARN [NIOServerCxn.Factory:22801:NIOServerCnxn@417] - Exception causing
> close of session 0xe727001201a4ee7c due to java.io.IOException: Len error
> 4348380
> in this case the client was a web monitoring app and had set both data and
> child watches on > 32k znodes.
> there are two issues I see here we need to fix:
> 1) handle this case properly (split up the set watches into multiple calls I
> guess...)
> 2) the session should have expired after the "timeout". however we seem to
> consider any message from the client as re-setting the expiration on the
> server side. Probably we should only consider messages from the client that
> are sent during an established session, otherwise we can see this situation
> where the session is not established however the session is not expired
> either. Perhaps we should create another JIRA for this particular issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)