Re: regionserver disconnection

Jean-Daniel Cryans Thu, 12 Nov 2009 12:02:32 -0800

That would make sense.

The master log shows us that the times fit, the region server had some
kind of pause for 1 minute and lost its session. It still looks like
the config of 10 minutes didn't get to the region servers.


J-D

On Thu, Nov 12, 2009 at 11:50 AM, Zhenyu Zhong <[email protected]> wrote:
> Stack,
>
> I am very appreciated for your great help and insightful comments.
>
> In my cluster, half of the cluster have 2 disks 400GB each per machine, and
> half of the cluster have 6 disks per machine.  Maybe we should run zookeeper
> on the machines with 2 disks and RS on machines with 6 disks?
>
> BTW, the 10 minutes zookeeper.session.timeout has been set during the
> experiment.
>
> thanks
> zhenyu
>
> On Thu, Nov 12, 2009 at 2:08 PM, stack <[email protected]> wrote:
>
>> On Thu, Nov 12, 2009 at 8:40 AM, Zhenyu Zhong <[email protected]
>> >wrote:
>>
>>
>> > Though I experienced 2 regionserver disconnection this morning, it looks
>> > better from the regionserver log. (please see the following log)
>> >
>> >
>> > http://pastebin.com/m496dbfae
>> >
>> >
>> > I checked diskIO for one of the regionserver(192.168.100.116) during the
>> > time it disconnected.
>> >
>> > Time: 03:04:58 AM
>> > Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>> > sda             105.31      5458.83     19503.64 9043873239 32312473676
>> > sda1              2.90         6.64        99.25   10993934  164433464
>> > sda2              1.72        23.76        30.25   39365817   50124008
>> > sda3              0.30         0.38         3.58     624930    5923000
>> > sda4            100.39      5428.06     19370.56 8992888270 32091993204
>> >
>> >
>> Is this high for you?  20k blocks/second would seem to be high but its one
>> disk only and its not being shared by zk anymore so shouldn't matter?
>>
>>
>> > I also checked the zookeeper quorum server that the regionserver tried to
>> > connect according to the log. However, I don't see 192.168.100.116 in the
>> > client list of the zookeeper quorum server that the regionserver tried to
>> > connect.
>> > Would that be a problem?
>> >
>> >
>> Is that because the ephemeral node for the regionserver had evaporated?
>> Lost
>> its lease w/ zk by the time you went to look?
>>
>>
>>
>>
>> > Thu Nov 12 15:04:52 UTC 2009
>> > Zookeeper version: 3.2.1-808558, built on 08/27/2009 18:48 GMT
>> > Clients:
>> >  /192.168.100.127:43045[1](queued=0,recved=26,sent=0)
>> >  /192.168.100.131:39091[1](queued=0,recved=964,sent=0)
>> >  /192.168.100.124:35961[1](queued=0,recved=3266,sent=0)
>> >  /192.168.100.123:47935[1](queued=0,recved=233,sent=0)
>> >  /192.168.100.125:46931[1](queued=0,recved=2,sent=0)
>> >  /192.168.100.118:54924[1](queued=0,recved=295,sent=0)
>> >  /192.168.100.118:41390[1](queued=0,recved=2290,sent=0)
>> >  /192.168.100.136:42243[1](queued=0,recved=0,sent=0)
>> >
>> > Latency min/avg/max: 0/17/6333
>> > Received: 47111
>> > Sent: 0
>> > Outstanding: 0
>> > Zxid: 0x77000083f4
>> > Mode: leader
>> > Node count: 23
>> >
>>
>>
>> That 6 second maximum latency is pretty bad but should be well within the
>> zk
>> session timeout.
>>
>> So, problem is likely on the zk client side of the session; i.e. in the RS.
>> You could enable GC logging as J-D suggested to see if you have any big
>> pauses, pauses > zk session timeout.
>>
>> When the RS went down, it didn't look too heavily loaded:
>>
>>
>>   1. 2009-11-12 15:04:52,830 INFO
>>   org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
>>   request=1.5166667, regions=322, stores=657, storefiles=631,
>>   storefileIndexSize=61, memstoreSize=1472, usedHeap=2819, maxHeap=4079,
>>   blockCacheSize=658110960, blockCacheFree=197395984, blockCacheCount=9903,
>>   blockCacheHitRatio=99
>>
>> Its serving a few reads?  The number of store files seems fine.   Not too
>> much memory used.
>>
>> Looking at the logs, I see the Lease Still Held exception.  This happens
>> when the RS does its regular report to the master but the master thinks the
>> RS has since restarted.  It'll think this probably because it noticed that
>> the RS's znode in zk had gone away and it considered the RS dead.
>>
>> Looking too at your  logs I see this gap in the zk pinging:
>>
>>
>>   1. 2009-11-12 15:03:39,325 DEBUG org.apache.zookeeper.ClientCnxn: Got
>>   ping response for sessionid:0x224e55436ad0004 after 0ms
>>   2. 2009-11-12 15:03:43,113 DEBUG org.apache.zookeeper.ClientCnxn: Got
>>   ping response for sessionid:0x24e55436a0007d after 0ms
>>
>> Where in the lines above it, its reporting about every ten seconds, here
>> there is a big gap.
>>
>> Do you have ganglia or something that will let you look more into what was
>> happening on this machine around this time?  Is the machine OK?  It looks
>> lightly loaded and you have your cluster nicely laid out.  Something odd is
>> going on.  What about things like the write speed to disk?  In the past
>> strange issues have been explained by incorrectly set BIOS which made disks
>> run at 1/100th of their proper speed.
>>
>> St.Ack
>>
>>
>>
>>
>>
>>
>>
>> >
>> > Best,
>> > zhenyu
>> >
>> >
>> > On Wed, Nov 11, 2009 at 3:58 PM, Zhenyu Zhong <[email protected]
>> > >wrote:
>> >
>> > > Stack
>> > >
>> > > I am very appreciated for your comments.
>> > > I will use the zookeeper monitoring script on my cluster and let it run
>> > > overnight to see the result.
>> > > I will follow up the thread when I get anything.
>> > >
>> > > thanks
>> > > zhenyu
>> > >
>> > >
>> > > On Wed, Nov 11, 2009 at 3:52 PM, stack <[email protected]> wrote:
>> > >
>> > >> I see these in your log too:
>> > >>
>> > >>
>> > >>   1. 2009-11-11 04:27:19,018 DEBUG org.apache.zookeeper.ClientCnxn:
>> Got
>> > >>   ping response for sessionid:0x424dfd908c50009 after 4544ms
>> > >>   2. 2009-11-11 04:27:19,018 DEBUG org.apache.zookeeper.ClientCnxn:
>> Got
>> > >>   ping response for sessionid:0x24dfd90c810002 after 4548ms
>> > >>   3. 2009-11-11 04:27:43,960 DEBUG org.apache.zookeeper.ClientCnxn:
>> Got
>> > >>   ping response for sessionid:0x424dfd908c50009 after 9030ms
>> > >>   4. 2009-11-11 04:27:43,960 DEBUG org.apache.zookeeper.ClientCnxn:
>> Got
>> > >>   ping response for sessionid:0x24dfd90c810002 after 9030ms
>> > >>   5. 2009-11-11 04:27:48,830 DEBUG org.apache.zookeeper.ClientCnxn:
>> Got
>> > >>   ping response for sessionid:0x424dfd908c50009 after 0ms
>> > >>   6. 2009-11-11 04:27:48,830 DEBUG org.apache.zookeeper.ClientCnxn:
>> Got
>> > >>   ping response for sessionid:0x24dfd90c810002 after 0ms
>> > >>
>> > >> Which would indicate lag talking to zk too.
>> > >>
>> > >> Appending HLog also seems to get steadily worse.
>> > >>
>> > >> I also see these:
>> > >>
>> > >>
>> > >>   1. 2009-11-11 04:27:01,233 WARN org.apache.zookeeper.ClientCnxn:
>> > >>    Exception closing session 0x24dfd90c810002 to
>> > >>    sun.nio.ch.selectionkeyi...@5e16a7e3
>> > >>   2. java.io.IOException: Read error rc = -1
>> > >>    java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
>> > >>    3.         at
>> > >>
>>  org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
>> > >>   4.         at
>> > >>    org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>> > >>
>> > >>
>> > >> Were all 5 quorum members participating?  Check by doing "echo stat|nc
>> > >> QUORUM_HOST port" for each member of quorum.... the below might help:
>> > >>
>> > >> members="1,2,3,4,5"
>> > >> while true; do date; for i in $members; do echo $i; echo stat | nc $i
>> > >> 2181;
>> > >> done; sleep 10; done
>> > >>
>> > >> We need to improve our zk_dump.  It just reports quorum members... not
>> > if
>> > >> they are actually participating or not.  The above will tell you if
>> dead
>> > >> or
>> > >> alive.
>> > >>
>> > >> Please report back after checking above and following the Jean-Daniel
>> > >> prescription.
>> > >>
>> > >> Thanks,
>> > >> St.Ack
>> > >>
>> > >>
>> > >>
>> > >> On Wed, Nov 11, 2009 at 12:29 PM, Zhenyu Zhong <
>> [email protected]
>> > >> >wrote:
>> > >>
>> > >> > Thanks a lot for the help.
>> > >> >
>> > >> > I understand that the change of disk configurations should be done
>> > >> > progressively such that HDFS can balance and recover from a disk
>> going
>> > >> > away.
>> > >> >
>> > >> > The data we are trying to import into HBase is about 200million
>> rows,
>> > >> each
>> > >> > row is about hundreds of bytes. Basically, we read in a text file
>> from
>> > >> > HDFS,
>> > >> > then  for each row in the text file, we fetch data from HBase table,
>> > we
>> > >> > compare the column value and update the latest value back into
>> HBase.
>> > I
>> > >> was
>> > >> > suggested to use CheckAndPut.  Anyway this is the common usecase on
>> my
>> > >> > side.
>> > >> >
>> > >> >
>> > >> > Many thanks
>> > >> > zhenyu
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Wed, Nov 11, 2009 at 3:03 PM, Jean-Daniel Cryans <
>> > >> [email protected]
>> > >> > >wrote:
>> > >> >
>> > >> > > 6 disks, 19 machines and you see those kind of issues? I'm quite
>> > >> > > surprised and I want to learn more about what you trying to to.
>> > >> > >
>> > >> > > But first, if hbase is managing ZK for you then you can change the
>> > >> > > value of hbase.zookeeper.property.dataDir so that it points to its
>> > own
>> > >> > > mount. Make sure you also change dfs.data.dir in
>> conf/hdfs-site.xml
>> > >> > > (hadoop stuff) so that it stops pointing there (you should have 6
>> > >> > > values there at the moment if you configured it correctly). Doing
>> > this
>> > >> > > could result in data loss if you do it all at once so do the
>> change
>> > >> > > progressively. This also means that you may have to keep a
>> seperate
>> > >> > > HDFS configuration for those particular nodes. Please ask more
>> > >> > > questions if it's not very clear.
>> > >> > >
>> > >> > > WRT to dedicating nodes to Zookeeeper, it makes sense only if you
>> > have
>> > >> > > a lot of nodes but in your case, unless there's also CPU
>> contention,
>> > >> > > you're good with only a dedicated disk.
>> > >> > >
>> > >> > > Now for what you are trying to do, is this a massive import that
>> > goes
>> > >> > > over many hours? How big is each row? I'd love to get a lot of
>> > details
>> > >> > > to help you optimize your cluster for your particular problem.
>> > >> > >
>> > >> > > Thx,
>> > >> > >
>> > >> > > J-D
>> > >> > >
>> > >> > > On Wed, Nov 11, 2009 at 11:51 AM, Zhenyu Zhong <
>> > >> [email protected]>
>> > >> > > wrote:
>> > >> > > > J-D,
>> > >> > > >
>> > >> > > > Thanks a lot for the insightful comments.
>> > >> > > > We are running 5 zookeeper quorums now. On these 5 quorum
>> servers,
>> > >> we
>> > >> > > also
>> > >> > > > run datanode, tasktracker and regionserver.
>> > >> > > > Now we have 6 disks per machine.
>> > >> > > > If it is the disk IO saturation, is there anyway to configure
>> > >> zookeeper
>> > >> > > to
>> > >> > > > write to a seperate disk which is not the same one used by HDFS?
>> > >> > > > Or shall we run zookeeper quorums on different nodes other than
>> > >> HDFS?
>> > >> > > >
>> > >> > > > Are HLogs generated by Regionservers? Would it alleviate the
>> issue
>> > >> if
>> > >> > we
>> > >> > > use
>> > >> > > > the first 14 for regionserver, and the rest of 5 for
>> zookeeper?(In
>> > >> that
>> > >> > > > sense, the node running zookeeper quorum won't run regionserver)
>> > >> > > >
>> > >> > > > I am very appreciated for your comments.
>> > >> > > >
>> > >> > > > Best,
>> > >> > > > zhenyu
>> > >> > > >
>> > >> > > >
>> > >> > > >
>> > >> > > > On Wed, Nov 11, 2009 at 2:30 PM, Jean-Daniel Cryans <
>> > >> > [email protected]
>> > >> > > >wrote:
>> > >> > > >
>> > >> > > >> The problem seems to be that the IO is saturated. When you see
>> > >> this:
>> > >> > > >>
>> > >> > > >> #
>> > >> > > >> 2009-11-11 04:28:10,308 WARN
>> > >> > > >> org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler 9
>> > on
>> > >> > > >> 60021 took 6819ms appending an edit to hlog; editcount=370
>> > >> > > >> #
>> > >> > > >> 2009-11-11 04:28:10,308 WARN
>> > >> > > >> org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler 2
>> > on
>> > >> > > >> 60021 took 4412ms appending an edit to hlog; editcount=371
>> > >> > > >>
>> > >> > > >> It's a sign that HDFS is slow. I guess that the zookeeper
>> quorum
>> > >> > > >> members are on those same nodes using the same disks? If so,
>> > >> there's
>> > >> > > >> the problem because Zookeeper needs to be able to commit
>> > >> transactions
>> > >> > > >> very fast and if your disks are already hogged then they have
>> to
>> > >> wait
>> > >> > > >> and you can get disconnections. Do you still have 19 quorum
>> > >> members?
>> > >> > > >> How many disks do you have per machine?
>> > >> > > >>
>> > >> > > >> Thx
>> > >> > > >>
>> > >> > > >> J-D
>> > >> > > >>
>> > >> > > >> On Wed, Nov 11, 2009 at 11:22 AM, Zhenyu Zhong <
>> > >> > [email protected]
>> > >> > > >
>> > >> > > >> wrote:
>> > >> > > >> > Please take a look at the following regionserver log.
>> > >> > > >> > The error started to come out from 4:24am UTC to 4:29am UTC.
>> > >> > > >> >
>> > >> > > >> > http://pastebin.com/m3717d064
>> > >> > > >> >
>> > >> > > >> > Thanks in advance.
>> > >> > > >> >
>> > >> > > >> > Best,
>> > >> > > >> > zhenyu
>> > >> > > >> >
>> > >> > > >> >
>> > >> > > >> > On Wed, Nov 11, 2009 at 2:01 PM, Jean-Daniel Cryans <
>> > >> > > [email protected]
>> > >> > > >> >wrote:
>> > >> > > >> >
>> > >> > > >> >> Could you paste a bigger log trace (interested to see at
>> least
>> > >> all
>> > >> > > 300
>> > >> > > >> >> lines before that) in a pastebin?
>> > >> > > >> >>
>> > >> > > >> >> Thx,
>> > >> > > >> >>
>> > >> > > >> >> J-D
>> > >> > > >> >>
>> > >> > > >> >> On Wed, Nov 11, 2009 at 10:56 AM, Zhenyu Zhong <
>> > >> > > [email protected]
>> > >> > > >> >
>> > >> > > >> >> wrote:
>> > >> > > >> >> > J-D,
>> > >> > > >> >> >
>> > >> > > >> >> > Thanks for the reply.
>> > >> > > >> >> >
>> > >> > > >> >> > The problem I have is one or more of the regionservers
>> > >> disconnect
>> > >> > > from
>> > >> > > >> >> the
>> > >> > > >> >> > cluster.
>> > >> > > >> >> >
>> > >> > > >> >> > I checked the resource usage including memory Heap usage
>> for
>> > >> > > >> >> regionserver,
>> > >> > > >> >> > and found nothing suspicious. The regionserver heap usage
>> is
>> > >> way
>> > >> > > below
>> > >> > > >> >> the
>> > >> > > >> >> > maxHeap defined in jvm.
>> > >> > > >> >> >
>> > >> > > >> >> > So I checked zookeeper log, then I found the following
>> > >> > > >> >> > BufferUnderflowException, eventually it leads to
>> > >> SocketException:
>> > >> > > >> >> Transport
>> > >> > > >> >> > endpoint is not connected.
>> > >> > > >> >> >
>> > >> > > >> >> >
>> > >> > > >> >> > I also checked the regionserver log. It looks like the
>> > >> > regionserver
>> > >> > > >> tried
>> > >> > > >> >> to
>> > >> > > >> >> > connect to other zookeeper quorum server. It connects
>> > >> > successfully,
>> > >> > > >> but
>> > >> > > >> >> then
>> > >> > > >> >> > it complained about SocketException: transport endpoint is
>> > not
>> > >> > > >> connected.
>> > >> > > >> >> >
>> > >> > > >> >> > So both zookeeper log and regionserver log told me that
>> the
>> > >> > > transport
>> > >> > > >> >> > endpoint is not connected. I don't know why.
>> > >> > > >> >> >
>> > >> > > >> >> > I would be very appreciated if you could point me to the
>> > right
>> > >> > > >> direction.
>> > >> > > >> >> >
>> > >> > > >> >> > Best,
>> > >> > > >> >> >
>> > >> > > >> >> >
>> > >> > > >> >> > Regionserver log:
>> > >> > > >> >> > 2009-11-11 04:24:53,718 INFO
>> > org.apache.zookeeper.ClientCnxn:
>> > >> > > >> Attempting
>> > >> > > >> >> > connection to server superpyxis0018/192.168.100.132:2181
>> > >> > > >> >> > 2009-11-11 04:24:53,718 INFO
>> > org.apache.zookeeper.ClientCnxn:
>> > >> > > Priming
>> > >> > > >> >> > connection to java.nio.channels.SocketChannel[connected
>> > >> local=/
>> > >> > > >> >> > 192.168.100.131:43065 remote=superpyxis0018/
>> > >> 192.168.100.132:2181
>> > >> > ]
>> > >> > > >> >> > 2009-11-11 04:24:53,718 INFO
>> > org.apache.zookeeper.ClientCnxn:
>> > >> > > Server
>> > >> > > >> >> > connection successful
>> > >> > > >> >> > 2009-11-11 04:24:53,768 INFO
>> > >> > > >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Got
>> > >> ZooKeeper
>> > >> > > >> event,
>> > >> > > >> >> > state: Disconnected, type: None, path: null
>> > >> > > >> >> > 2009-11-11 04:24:54,118 INFO
>> > org.apache.zookeeper.ClientCnxn:
>> > >> > > >> Attempting
>> > >> > > >> >> > connection to server superpyxis0017/192.168.100.131:2181
>> > >> > > >> >> > 2009-11-11 04:24:54,118 INFO
>> > org.apache.zookeeper.ClientCnxn:
>> > >> > > Priming
>> > >> > > >> >> > connection to java.nio.channels.SocketChannel[connected
>> > >> local=/
>> > >> > > >> >> > 192.168.100.131:46903 remote=superpyxis0017/
>> > >> 192.168.100.131:2181
>> > >> > ]
>> > >> > > >> >> > 2009-11-11 04:24:54,118 INFO
>> > org.apache.zookeeper.ClientCnxn:
>> > >> > > Server
>> > >> > > >> >> > connection successful
>> > >> > > >> >> > 2009-11-11 04:24:54,118 WARN
>> > org.apache.zookeeper.ClientCnxn:
>> > >> > > >> Exception
>> > >> > > >> >> > closing session 0x24dfd90c810002 to
>> > >> > > >> sun.nio.ch.selectionkeyi...@2dd78ccf
>> > >> > > >> >> > java.io.IOException: Read error rc = -1
>> > >> > > >> java.nio.DirectByteBuffer[pos=0
>> > >> > > >> >> > lim=4 cap=4]
>> > >> > > >> >> >        at
>> > >> > > >> >> >
>> > >> > >
>> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
>> > >> > > >> >> >        at
>> > >> > > >> >> >
>> > >> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>> > >> > > >> >> > 2009-11-11 04:24:54,118 WARN
>> > org.apache.zookeeper.ClientCnxn:
>> > >> > > Ignoring
>> > >> > > >> >> > exception during shutdown input
>> > >> > > >> >> > java.net.SocketException: Transport endpoint is not
>> > connected
>> > >> > > >> >> >        at sun.nio.ch.SocketChannelImpl.shutdown(Native
>> > Method)
>> > >> > > >> >> >        at
>> > >> > > >> >> >
>> > >> > >
>> > sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
>> > >> > > >> >> >        at
>> > >> > > >> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
>> > >> > > >> >> >        at
>> > >> > > >> >> >
>> > >> > > >>
>> > >> >
>> > org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
>> > >> > > >> >> >        at
>> > >> > > >> >> >
>> > >> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
>> > >> > > >> >> >
>> > >> > > >> >> >
>> > >> > > >> >> >
>> > >> > > >> >> >
>> > >> > > >> >> >
>> > >> > > >> >> > On Wed, Nov 11, 2009 at 1:40 PM, Jean-Daniel Cryans <
>> > >> > > >> [email protected]
>> > >> > > >> >> >wrote:
>> > >> > > >> >> >
>> > >> > > >> >> >> I'm really not sure exactly what the situation is with
>> the
>> > >> > > >> >> >> java.nio.BufferUnderflowException, but what do you mean
>> by
>> > >> > > >> >> >> "regionserver disconnection"? Could you give more
>> details?
>> > >> > > >> >> >>
>> > >> > > >> >> >> Thx,
>> > >> > > >> >> >>
>> > >> > > >> >> >> J-D
>> > >> > > >> >> >>
>> > >> > > >> >> >> On Wed, Nov 11, 2009 at 10:34 AM, Zhenyu Zhong <
>> > >> > > >> [email protected]
>> > >> > > >> >> >
>> > >> > > >> >> >> wrote:
>> > >> > > >> >> >> > I searched around and found HBASE-1645 "HBase could be
>> a
>> > >> > little
>> > >> > > >> more
>> > >> > > >> >> >> dainty
>> > >> > > >> >> >> > when zk is pulled out from under it"
>> > >> > > >> >> >> >
>> > >> > > >> >> >> > I am wondering whether my problem is related to this
>> > >> > HBase-1645.
>> > >> > > Or
>> > >> > > >> >> >> probably
>> > >> > > >> >> >> > has nothing to do with it?
>> > >> > > >> >> >> >
>> > >> > > >> >> >> > thanks
>> > >> > > >> >> >> >
>> > >> > > >> >> >> >
>> > >> > > >> >> >> > On Wed, Nov 11, 2009 at 12:02 PM, Zhenyu Zhong <
>> > >> > > >> >> [email protected]
>> > >> > > >> >> >> >wrote:
>> > >> > > >> >> >> >
>> > >> > > >> >> >> >> Hi,
>> > >> > > >> >> >> >>
>> > >> > > >> >> >> >> I have been monitoring the system resources such as
>> > memory
>> > >> > Heap
>> > >> > > >> usage
>> > >> > > >> >> >> for
>> > >> > > >> >> >> >> Regionserver, load average, network bandwidth, etc.
>> > >> > > >> >> >> >>
>> > >> > > >> >> >> >> I experienced a disconnection last night. When I
>> checked
>> > >> the
>> > >> > > >> memory
>> > >> > > >> >> Heap
>> > >> > > >> >> >> >> for Regionserver, it was 2336 which is way below the
>> max
>> > >> Heap
>> > >> > > >> 4096. I
>> > >> > > >> >> >> don't
>> > >> > > >> >> >> >> think it is Regionserver problem. The requests that
>> the
>> > >> > > >> regionserver
>> > >> > > >> >> got
>> > >> > > >> >> >> was
>> > >> > > >> >> >> >> about 126 which doesn't seem to be high. The load
>> > average
>> > >> is
>> > >> > > kind
>> > >> > > >> of
>> > >> > > >> >> >> high
>> > >> > > >> >> >> >> >100. (but I saw disconnections while the load average
>> > is
>> > >> > > around
>> > >> > > >> 30).
>> > >> > > >> >> >> The
>> > >> > > >> >> >> >> bandwidth is not a problem.
>> > >> > > >> >> >> >>
>> > >> > > >> >> >> >> However, I found error messages from the zookeeper.
>> > Please
>> > >> > see
>> > >> > > the
>> > >> > > >> >> >> >> following error messages. There are some
>> > >> > > >> >> >> >> BufferUnderflowException error, IOException error,etc.
>> > >> > > >> >> >> >> May I ask what might be the causes that generate these
>> > >> error
>> > >> > > >> >> messages?
>> > >> > > >> >> >> >>
>> > >> > > >> >> >> >> Many thanks
>> > >> > > >> >> >> >> zhenyu
>> > >> > > >> >> >> >>
>> > >> > > >> >> >> >>
>> > >> > > >> >> >> >> 2009-11-11 04:29:00,810 WARN
>> > >> > > org.apache.zookeeper.server.Request:
>> > >> > > >> >> >> Ignoring
>> > >> > > >> >> >> >> exception during toString
>> > >> > > >> >> >> >> java.nio.BufferUnderflowException
>> > >> > > >> >> >> >>         at
>> > >> > java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127)
>> > >> > > >> >> >> >>         at
>> java.nio.ByteBuffer.get(ByteBuffer.java:675)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >>
>> > >> org.apache.zookeeper.server.Request.toString(Request.java:199)
>> > >> > > >> >> >> >>         at java.lang.String.valueOf(String.java:2827)
>> > >> > > >> >> >> >>         at
>> > >> > > java.lang.StringBuilder.append(StringBuilder.java:115)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >> >>
>> > >> > > >> >>
>> > >> > > >>
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.zookeeper.server.quorum.CommitProcessor.processRequest(CommitProcessor.java:167)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >> >>
>> > >> > > >> >>
>> > >> > > >>
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.zookeeper.server.quorum.FollowerRequestProcessor.run(FollowerRequestProcessor.java:68)
>> > >> > > >> >> >> >> 2009-11-11 04:29:00,810 WARN
>> > >> > > >> >> org.apache.zookeeper.server.NIOServerCnxn:
>> > >> > > >> >> >> >> Exception causing close of session 0x424dfd908c50009
>> > >> > > >> >> >> >> due to java.io.IOException: Read error
>> > >> > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG
>> > >> > > >> >> org.apache.zookeeper.server.NIOServerCnxn:
>> > >> > > >> >> >> >> IOException stack trace
>> > >> > > >> >> >> >> java.io.IOException: Read error
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >>
>> > >> > >
>> > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:372)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >> >>
>> > >> > > >> >>
>> > >> > > >>
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:221)
>> > >> > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG
>> > >> > > >> >> >> >> org.apache.zookeeper.server.quorum.CommitProcessor:
>> > >> > Processing
>> > >> > > >> >> request::
>> > >> > > >> >> >> >> sessionid:0x424dfd908c50009 type:setWatches
>> > >> > > >> cxid:0xfffffffffffffff8
>> > >> > > >> >> >> >> zxid:0xfffffffffffffffe txntype:unknown n/a
>> > >> > > >> >> >> >> 2009-11-11 04:29:00,810 WARN
>> > >> > > org.apache.zookeeper.server.Request:
>> > >> > > >> >> >> Ignoring
>> > >> > > >> >> >> >> exception during toString
>> > >> > > >> >> >> >> java.nio.BufferUnderflowException
>> > >> > > >> >> >> >>         at
>> > >> > java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127)
>> > >> > > >> >> >> >>         at
>> java.nio.ByteBuffer.get(ByteBuffer.java:675)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >>
>> > >> org.apache.zookeeper.server.Request.toString(Request.java:199)
>> > >> > > >> >> >> >>         at java.lang.String.valueOf(String.java:2827)
>> > >> > > >> >> >> >>         at
>> > >> > > java.lang.StringBuilder.append(StringBuilder.java:115)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >> >>
>> > >> > > >> >>
>> > >> > > >>
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:74)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >> >>
>> > >> > > >> >>
>> > >> > > >>
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
>> > >> > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG
>> > >> > > >> >> >> >> org.apache.zookeeper.server.FinalRequestProcessor:
>> > >> Processing
>> > >> > > >> >> request::
>> > >> > > >> >> >> >> sessionid:0x424dfd908c50009 type:setWatches
>> > >> > > >> cxid:0xfffffffffffffff8
>> > >> > > >> >> >> >> zxid:0xfffffffffffffffe txntype:unknown n/a
>> > >> > > >> >> >> >> 2009-11-11 04:29:00,810 WARN
>> > >> > > org.apache.zookeeper.server.Request:
>> > >> > > >> >> >> Ignoring
>> > >> > > >> >> >> >> exception during toString
>> > >> > > >> >> >> >> java.nio.BufferUnderflowException
>> > >> > > >> >> >> >>         at
>> > >> > java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127)
>> > >> > > >> >> >> >>         at
>> java.nio.ByteBuffer.get(ByteBuffer.java:675)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >>
>> > >> org.apache.zookeeper.server.Request.toString(Request.java:199)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >>
>> > >> org.apache.log4j.or.DefaultRenderer.doRender(DefaultRenderer.java:36)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > org.apache.log4j.or.RendererMap.findAndRender(RendererMap.java:80)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >> >>
>> > >> > > >> >>
>> > >> > > >>
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.log4j.spi.LoggingEvent.getRenderedMessage(LoggingEvent.java:362)
>> > >> > > >> >> >> >> at
>> > >> > > >> >> >> >>
>> > >> > > >> >> >>
>> > >> > > >> >>
>> > >> > > >>
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.log4j.helpers.PatternParser$BasicPatternConverter.convert(PatternParser.java:403)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >> >>
>> > >> > > >> >>
>> > >> > > >>
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >>
>> org.apache.log4j.PatternLayout.format(PatternLayout.java:502)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:302)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >> >>
>> > >> > > >> >>
>> > >> > > >>
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:359)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >>
>> > >> org.apache.log4j.WriterAppender.append(WriterAppender.java:160)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >>
>> > >> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >> >>
>> > >> > > >> >>
>> > >> > > >>
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>> > >> > > >> >> >> >>         at
>> > >> > > >> org.apache.log4j.Category.callAppenders(Category.java:206)
>> > >> > > >> >> >> >>         at
>> > >> > > org.apache.log4j.Category.forcedLog(Category.java:391)
>> > >> > > >> >> >> >>         at
>> > >> org.apache.log4j.Category.debug(Category.java:260)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >> >>
>> > >> > > >> >>
>> > >> > > >>
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:147)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >> >>
>> > >> > > >> >>
>> > >> > > >>
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
>> > >> > > >> >> >> >> 2009-11-11 04:29:00,810 WARN
>> > >> > > org.apache.zookeeper.server.Request:
>> > >> > > >> >> >> Ignoring
>> > >> > > >> >> >> >> exception during toString
>> > >> > > >> >> >> >> sessionid:0x424dfd908c50009 type:setWatches
>> > >> > > >> cxid:0xfffffffffffffff8
>> > >> > > >> >> >> >> zxid:0xfffffffffffffffe txntype:unknown n/a
>> > >> > > >> >> >> >> 2009-11-11 04:29:00,810 INFO
>> > >> > > >> >> org.apache.zookeeper.server.NIOServerCnxn:
>> > >> > > >> >> >> >> closing session:0x424dfd908c50009 NIOServerCnxn:
>> > >> > > >> >> >> >> java.nio.channels.SocketChannel[connected local=/
>> > >> > > >> >> 192.168.100.131:2181
>> > >> > > >> >> >> remote=/
>> > >> > > >> >> >> >> 192.168.100.131:35500]
>> > >> > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG
>> > >> > > >> >> org.apache.zookeeper.server.NIOServerCnxn:
>> > >> > > >> >> >> >> ignoring exception during input shutdown
>> > >> > > >> >> >> >> java.net.SocketException: Transport endpoint is not
>> > >> connected
>> > >> > > >> >> >> >>         at
>> sun.nio.ch.SocketChannelImpl.shutdown(Native
>> > >> > Method)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >>
>> > >> > >
>> > sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >>
>> > >> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >>
>> > >> > >
>> > >>
>> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:848)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >>
>> > >> > >
>> > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:498)
>> > >> > > >> >> >> >>         at
>> > >> > > >> >> >> >>
>> > >> > > >> >> >>
>> > >> > > >> >>
>> > >> > > >>
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:221)
>> > >> > > >> >> >> >>
>> > >> > > >> >> >> >
>> > >> > > >> >> >>
>> > >> > > >> >> >
>> > >> > > >> >>
>> > >> > > >> >
>> > >> > > >>
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>

Re: regionserver disconnection

Reply via email to