From the zk fellas, try running your zk cluster at INFO level rather than
at
DEBUG. One supposition is that the shear amount of logging is causing
distress. They are still looking into your logs...
Thanks for your patience Zhenyu.
St.Ack
On Thu, Nov 19, 2009 at 8:57 AM, Zhenyu Zhong <[email protected]
wrote:
Yes, I believe so. And I don't know why the leader gave up.
For this case, the regionserver got TIMEOUT warning message at
2009-11-18
12:42:01,482
Best,
zhenyu
On Thu, Nov 19, 2009 at 11:43 AM, stack <[email protected]> wrote:
So, the time at which we see
2009-11-18 12:41:56,782 ERROR
org.apache.zookeeper.server.quorum.FollowerHandler: Unexpected
exception
causing shutdown while sock still open
... corresponds to the time at which the regionserver loses is session?
St.Ack
On Thu, Nov 19, 2009 at 8:38 AM, Zhenyu Zhong <[email protected]
wrote:
Here is the zookeeper leader log. It had an unexpected exception. I
don't
know why.
http://pastebin.com/m6594ccc
During that time, I don't think the system load is high. Memory usage
should
be normal.
Best,
zhenyu
On Thu, Nov 19, 2009 at 11:09 AM, stack <[email protected]> wrote:
What is in the other zk logs? Why did the leader go away? Memory
issue
or
something? Why'd it shutdown? Thanks for pastebin'ing this stuff.
St.Ack
On Thu, Nov 19, 2009 at 7:53 AM, Zhenyu Zhong <
[email protected]
wrote:
After digging more on the zookeeper log, I found that there
seemed
to
be
a
socketconnection timeout. Also some warnings indicate that
zookeeper
server
is not running. Probably I need to understand the mechanism of
zookeeper
first.
Please find the zookeeper log around that time.
http://pastebin.com/m37bb4ad1
Many thanks
zhenyu
On Thu, Nov 19, 2009 at 10:31 AM, Zhenyu Zhong <
[email protected]
wrote:
Stack,
I am very appreciated. Let me dig into the zookeeper log more.
FYI, I saw the sleeper complained about the delay for the first
time.
It
looks like there are different reasons for the RS
disconnections.
2009-11-19 08:08:32,708 DEBUG org.apache.zookeeper.ClientCnxn:
Got
ping
response for sessionid:0x42508716bfc0001 after 0ms
2009-11-19 08:08:37,595 DEBUG org.apache.zookeeper.ClientCnxn:
Got
ping
response for sessionid:0x42508716bfc0082 after 0ms
2009-11-19 08:08:46,073 DEBUG org.apache.zookeeper.ClientCnxn:
Got
ping
response for sessionid:0x42508716bfc0001 after 0ms
2009-11-19 08:09:39,880 WARN org.apache.zookeeper.ClientCnxn:
Exception
closing session 0x42508716bfc0001 to
sun.nio.ch.selectionkeyi...@4f3a608f
java.io.IOException: TIMED OUT
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2009-11-19 08:09:39,880 WARN org.apache.zookeeper.ClientCnxn:
Exception
closing session 0x42508716bfc0082 to
sun.nio.ch.selectionkeyi...@14c8875
java.io.IOException: TIMED OUT
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2009-11-19 08:09:39,880 WARN
org.apache.hadoop.hbase.util.Sleeper:
We
slept
53556ms, ten times longer than scheduled: 3000
2009-11-19 08:09:39,980 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at
java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
at java.lang.Thread.run(Thread.java:619)
2009-11-19 08:09:39,980 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got
ZooKeeper
event,
state: Disconnected, type: None, path: null
2009-11-19 08:09:40,530 INFO org.apache.zookeeper.ClientCnxn:
Attempting
connection to server superpyxis0005/192.168.100.119:2181
2009-11-19 08:09:40,530 INFO org.apache.zookeeper.ClientCnxn:
Priming
connection to java.nio.channels.SocketChannel[connected local=/
192.168.100.122:40659 remote=superpyxis0005/
192.168.100.119:2181
]
Best,
zhenyu
On Wed, Nov 18, 2009 at 6:52 PM, stack <[email protected]>
wrote:
Patrick Hunt, one of the zk lads, answered with the following:
According to the following snippet (I cut to just session
0x425035c48210002)
the client is getting pings for this session id regularly
every
6
seconds
or
so, then after 41,016 there's a 20 second gap (corresponds to
a
30
second
timeout being set during session creation) at which point we
disconnect
from
the server.
2009-11-18 12:41:34,788 DEBUG
org.apache.zookeeper.ClientCnxn:
Got
ping
response for sessionid:0x425035c48210002 after 0ms
2009-11-18 12:42:01,482 WARN
org.apache.zookeeper.ClientCnxn:
Exception
closing session 0x425035c48210002 to
sun.nio.ch.selectionkeyi...@421690ab
java.io.IOException: TIMED OUT
the client notifies hbase of the disco, and attempts to
reconnect
to
a
server
org.apache.hadoop.hbase.regionserver.HRegionServer: Got
ZooKeeper
event,
state: Disconnected, type: None, path: null
2009-11-18 12:42:01,782 INFO
org.apache.zookeeper.ClientCnxn:
Attempting
connection to server superpyxis0003/192.168.100.117:2181
but the server won't allow the session be be established for
some
reason
(is
that the right server:port that's it's trying to connect to?)
2009-11-18 12:42:02,182 INFO
org.apache.zookeeper.ClientCnxn:
Server
connection successful
2009-11-18 12:42:02,182 WARN
org.apache.zookeeper.ClientCnxn:
Exception
closing session 0x425035c48210002 to
sun.nio.ch.selectionkeyi...@5c07076b
java.io.IOException: Read error rc = -1
java.nio.DirectByteBuffer[pos=0
lim=4 cap=4]
The "read error rc = -1" means that the client read from the
socket
and
got
EOS. But it's not a session expiration afaict (and should not
be
at
this
point in time).
You need to look at the ZK server logs for this period in time
(for
this
server superpyxis0003:2181, and superpyxis0001:2181, the
client
tries
to
connect to both of these). There should be some indication
there
why
the
server is closing down the connection and not allowing the
session
to
be
re-established.
One thought - by default we limit clients to a max of 10
connections
from
a
particular host to the server. If you exceed this limit the zk
server
(it's
per server though, not per cluster, so perhaps this is not it)
will
not
allow any additional connections from the client host (you
can't
exceed
10
on a server). Could this be it? i.e. the user is running a
number
of
ZK
clients (say hbase processes) from the same host? Regardless,
get
the
server
log for this time period and it should be more clear.
Patrick
On Wed, Nov 18, 2009 at 11:18 AM, stack <[email protected]>
wrote:
And running stat on each of your zookeeper nodes tells you
what
about
averages for connect-times?
Its odd that there is a ping just 20 seconds beforehand and
you
have
set
the zk session timeout at 60 seconds (or ten minutes?). Let
me
ask
the
zk
lads about it.
If you look in the GC log for around this time, do you see
anything?
St.Ack
On Wed, Nov 18, 2009 at 7:32 AM, Zhenyu Zhong <
[email protected]
wrote:
After running more experiment, it seems that the
disconnection
is
not
related to the heavy load job, because the load average is
low,
the
diskio
is normal, memory Heap was not reached, also the virtual
memory
stats
shows
that no swappings. However, the disconnection still
happens.
It looks like this time it pauses for 20 seconds. No idea
why
regionserver
disconnected.
Any other suggestions please?
2009-11-18 12:41:00,956 DEBUG
org.apache.zookeeper.ClientCnxn:
Got
ping
response for sessionid:0x125035c495d02bb after 0ms
2009-11-18 12:41:08,074 DEBUG
org.apache.zookeeper.ClientCnxn:
Got
ping
response for sessionid:0x425035c48210002 after 0ms
2009-11-18 12:41:14,302 DEBUG
org.apache.zookeeper.ClientCnxn:
Got
ping
response for sessionid:0x125035c495d02bb after 0ms
2009-11-18 12:41:21,433 DEBUG
org.apache.zookeeper.ClientCnxn:
Got
ping
response for sessionid:0x425035c48210002 after 0ms
2009-11-18 12:41:27,656 DEBUG
org.apache.zookeeper.ClientCnxn:
Got
ping
response for sessionid:0x125035c495d02bb after 0ms
2009-11-18 12:41:34,788 DEBUG
org.apache.zookeeper.ClientCnxn:
Got
ping
response for sessionid:0x425035c48210002 after 0ms
2009-11-18 12:41:41,016 DEBUG
org.apache.zookeeper.ClientCnxn:
Got
ping
response for sessionid:0x125035c495d02bb after 1ms
2009-11-18 12:42:01,482 WARN
org.apache.zookeeper.ClientCnxn:
Exception
closing session 0x425035c48210002 to
sun.nio.ch.selectionkeyi...@421690ab
java.io.IOException: TIMED OUT
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2009-11-18 12:42:01,582 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got
ZooKeeper
event,
state: Disconnected, type: None, path: null
2009-11-18 12:42:01,782 INFO
org.apache.zookeeper.ClientCnxn:
Attempting
connection to server superpyxis0003/192.168.100.117:2181
2009-11-18 12:42:01,782 INFO
org.apache.zookeeper.ClientCnxn:
Priming
connection to java.nio.channels.SocketChannel[connected
local=/
192.168.100.132:46610 remote=superpyxis0003/
192.168.100.117:2181
]
2009-11-18 12:42:01,782 INFO
org.apache.zookeeper.ClientCnxn:
Server
connection successful
2009-11-18 12:42:01,782 WARN
org.apache.zookeeper.ClientCnxn:
Exception
closing session 0x425035c48210002 to
sun.nio.ch.selectionkeyi...@56dc6fac
java.io.IOException: Read error rc = -1
java.nio.DirectByteBuffer[pos=0
lim=4 cap=4]
at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
2009-11-18 12:42:01,782 WARN
org.apache.zookeeper.ClientCnxn:
Ignoring
exception during shutdown input
java.net.SocketException: Transport endpoint is not
connected
at sun.nio.ch.SocketChannelImpl.shutdown(Native
Method)
at
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
at
sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
2009-11-18 12:42:01,782 WARN
org.apache.zookeeper.ClientCnxn:
Ignoring
exception during shutdown output
java.net.SocketException: Transport endpoint is not
connected
at sun.nio.ch.SocketChannelImpl.shutdown(Native
Method)
at
sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
at
sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1004)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
2009-11-18 12:42:02,182 INFO
org.apache.zookeeper.ClientCnxn:
Attempting
connection to server superpyxis0001/192.168.100.115:2181
2009-11-18 12:42:02,182 INFO
org.apache.zookeeper.ClientCnxn:
Priming
connection to java.nio.channels.SocketChannel[connected
local=/
192.168.100.132:36197 remote=superpyxis0001/
192.168.100.115:2181
]
2009-11-18 12:42:02,182 INFO
org.apache.zookeeper.ClientCnxn:
Server
connection successful
2009-11-18 12:42:02,182 WARN
org.apache.zookeeper.ClientCnxn:
Exception
closing session 0x425035c48210002 to
sun.nio.ch.selectionkeyi...@5c07076b
java.io.IOException: Read error rc = -1
java.nio.DirectByteBuffer[pos=0
lim=4 cap=4]
at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
2009-11-18 12:42:02,182 WARN
org.apache.zookeeper.ClientCnxn:
Ignoring
exception during shutdown input
java.net.SocketException: Transport endpoint is not
connected
at sun.nio.ch.SocketChannelImpl.shutdown(Native