Re: regionserver disconnection

Jean-Daniel Cryans Fri, 13 Nov 2009 14:03:34 -0800

AFAIK there's no requirement on the kernel version, 2.6.27 is the only
one that had that restriction and it was gone in the next version.


We do have a requirement for java6 per
http://hadoop.apache.org/hbase/docs/r0.20.1/api/overview-summary.html#overview_description

J-D

On Fri, Nov 13, 2009 at 1:58 PM, Zhenyu Zhong <[email protected]> wrote:
> Does HBase/Hadoop have to run on kernel 2.6.27 or jdk 1.6? It seems only
> that kernel provides epoll resource configuration.
>
> This is the first time I saw this error when I use machines with less
> resource for zookeeper.
>
> Probably, I should change it back.
>
>
> zhenyu
>
>
> On Fri, Nov 13, 2009 at 4:37 PM, Zhenyu Zhong <[email protected]>wrote:
>
>> The ulimit file descriptors was set to fs.file-max =1578334, also in the
>> limits.conf the value was set to 32768.
>> So these are way higher than the open descriptors for the running
>> processes.
>>
>> thanks
>> zhenyu
>>
>>
>>
>> On Fri, Nov 13, 2009 at 4:33 PM, Stack <[email protected]> wrote:
>>
>>> You upped the ulimit file descriptors as per the getting started doc?
>>>
>>>
>>>
>>> On Nov 13, 2009, at 1:26 PM, Zhenyu Zhong <[email protected]>
>>> wrote:
>>>
>>>  Thanks a lot.
>>>>
>>>>
>>>> Bad news is my kernel is still 2.6.26.
>>>> But it was not a problem before.
>>>>
>>>> Very strange.
>>>>
>>>> zhenyu
>>>>
>>>> On Fri, Nov 13, 2009 at 4:16 PM, Jean-Daniel Cryans <[email protected]
>>>> >wrote:
>>>>
>>>>  Looks like
>>>>>
>>>>> http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/
>>>>>
>>>>> J-D
>>>>>
>>>>> On Fri, Nov 13, 2009 at 1:12 PM, Zhenyu Zhong <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> After I re-organize the cluster, the experiment ran into problem faster
>>>>>>
>>>>> than
>>>>>
>>>>>> before.
>>>>>>
>>>>>> Basically the changes are to use machines with less resources as
>>>>>>
>>>>> zookeeper
>>>>>
>>>>>> quorums and machines with more resources as regionserver.
>>>>>>
>>>>>> From the log, I still see the pause around 1 minute.
>>>>>> I enabled the GC logging,  please see
>>>>>>
>>>>>> http://pastebin.com/m1d4ce0f1
>>>>>>
>>>>>> for details.
>>>>>> In general I don't see many pauses in the GC.
>>>>>>
>>>>>> What is more interesting, 1 hour after the 1st regionserver
>>>>>> disconnected,
>>>>>> the master log complained about too many open files. This didn't happen
>>>>>> before.
>>>>>> I checked the system OS setup as well as the limits.conf. I also
>>>>>> checked
>>>>>>
>>>>> the
>>>>>
>>>>>> running processes. The total open files don't reach the limit. I am
>>>>>>
>>>>> confused
>>>>>
>>>>>> a bit.
>>>>>>
>>>>>>
>>>>>> Please see the following master log.
>>>>>>
>>>>>> 2009-11-13 20:06:02,114 INFO
>>>>>> org.apache.hadoop.hbase.master.BaseScanner:
>>>>>> RegionManager.metaScanner scan of 4658 row(s) of meta region {server:
>>>>>> 192.168.100.128:60021, regionname: .META.,,1, startKey: <>} complete
>>>>>> 2009-11-13 20:06:02,114 INFO
>>>>>> org.apache.hadoop.hbase.master.BaseScanner:
>>>>>>
>>>>> All
>>>>>
>>>>>> 1 .META. region(s) scanned
>>>>>> 2009-11-13 20:06:07,677 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>>>>>> response for sessionid:0x424eebf1c10004c after 3ms
>>>>>> 2009-11-13 20:06:08,178 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Exception
>>>>>>
>>>>> in
>>>>>
>>>>>> createBlockOutputStream java.io.IOException: Bad connect ack with
>>>>>> firstBadLink 192.168.100.123:50010
>>>>>> 2009-11-13 20:06:08,178 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Abandoning
>>>>>> block blk_-2808245019291145247_5478039
>>>>>> 2009-11-13 20:06:09,682 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Exception
>>>>>>
>>>>> in
>>>>>
>>>>>> createBlockOutputStream java.io.EOFException
>>>>>> 2009-11-13 20:06:09,682 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Abandoning
>>>>>> block blk_1074853606841896259_5478048
>>>>>> 2009-11-13 20:06:10,334 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>>>>>> response for sessionid:0x24eebf1043003c after 1ms
>>>>>> 2009-11-13 20:06:21,018 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>>>>>> response for sessionid:0x424eebf1c10004c after 0ms
>>>>>> 2009-11-13 20:06:23,674 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>>>>>> response for sessionid:0x24eebf1043003c after 0ms
>>>>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Exception
>>>>>>
>>>>> in
>>>>>
>>>>>> createBlockOutputStream java.io.IOException: Bad connect ack with
>>>>>> firstBadLink 192.168.100.123:50010
>>>>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Abandoning
>>>>>> block blk_-6642544517082142289_5478063
>>>>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Exception
>>>>>>
>>>>> in
>>>>>
>>>>>> createBlockOutputStream java.net.SocketException: Too many open files
>>>>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Abandoning
>>>>>> block blk_2057511041109796090_5478063
>>>>>> 2009-11-13 20:06:24,928 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Exception
>>>>>>
>>>>> in
>>>>>
>>>>>> createBlockOutputStream java.net.SocketException: Too many open files
>>>>>> 2009-11-13 20:06:24,928 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Abandoning
>>>>>> block blk_8219260302213892894_5478064
>>>>>> 2009-11-13 20:06:30,855 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Exception
>>>>>>
>>>>> in
>>>>>
>>>>>> createBlockOutputStream java.net.SocketException: Too many open files
>>>>>> 2009-11-13 20:06:30,855 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Abandoning
>>>>>> block blk_1669205542853067709_5478235
>>>>>> 2009-11-13 20:06:30,905 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Exception
>>>>>>
>>>>> in
>>>>>
>>>>>> createBlockOutputStream java.net.SocketException: Too many open files
>>>>>> 2009-11-13 20:06:30,905 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Abandoning
>>>>>> block blk_9128897691346270351_5478237
>>>>>> 2009-11-13 20:06:30,955 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Exception
>>>>>>
>>>>> in
>>>>>
>>>>>> createBlockOutputStream java.net.SocketException: Too many open files
>>>>>> 2009-11-13 20:06:30,955 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Abandoning
>>>>>> block blk_1116845144864123018_5478240
>>>>>> 2009-11-13 20:06:34,372 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>>>>>> response for sessionid:0x424eebf1c10004c after 0ms
>>>>>> 2009-11-13 20:06:37,034 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>>>>>> response for sessionid:0x24eebf1043003c after 0ms
>>>>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient:
>>>>>>
>>>>> DataStreamer
>>>>>
>>>>>> Exception: java.io.IOException: Too many open files
>>>>>>      at sun.nio.ch.IOUtil.initPipe(Native Method)
>>>>>>      at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
>>>>>>      at
>>>>>>
>>>>>> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
>>>>>
>>>>>>      at
>>>>>>
>>>>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
>>>>>
>>>>>>      at
>>>>>>
>>>>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
>>>>>
>>>>>>      at
>>>>>>
>>>>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>>>>>
>>>>>>      at
>>>>>>
>>>>>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>>>>>
>>>>>>      at
>>>>>>
>>>>>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>>>>>
>>>>>>      at
>>>>>>
>>>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>>>>>
>>>>>>      at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>>>>      at
>>>>>>
>>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)
>>>>>
>>>>>>
>>>>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>>>> Recovery for block blk_8148813491785406356_5478475 bad datanode[0]
>>>>>> 192.168.100.123:50010
>>>>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>>>> Recovery for block blk_8148813491785406356_5478475 in pipeline
>>>>>> 192.168.100.123:50010, 192.168.100.134:50010, 192.168.100.122:50010:
>>>>>> bad
>>>>>> datanode 192.168.100.123:50010
>>>>>> 2009-11-13 20:06:37,436 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Exception
>>>>>>
>>>>> in
>>>>>
>>>>>> createBlockOutputStream java.net.SocketException: Too many open files
>>>>>> 2009-11-13 20:06:37,436 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>>> Abandoning
>>>>>> block blk_2119727700857186236_5478498
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 12, 2009 at 4:21 PM, Zhenyu Zhong <[email protected]
>>>>>> wrote:
>>>>>>
>>>>>>  Will do.
>>>>>>>
>>>>>>> thanks
>>>>>>> zhenyu
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 12, 2009 at 3:33 PM, stack <[email protected]> wrote:
>>>>>>>
>>>>>>>  Enable GC logging too on this next run (see hbase-env.sh).  Lets try
>>>>>>>>
>>>>>>> and
>>>>>
>>>>>>  get
>>>>>>>> to the bottom of whats going on.
>>>>>>>> Thanks,
>>>>>>>> St.Ack
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2009 at 12:29 PM, Zhenyu Zhong <
>>>>>>>>
>>>>>>> [email protected]
>>>>>
>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>
>>>>>>>>  I can switch the boxes that run zookeeper with the ones that run
>>>>>>>>> regionservers.
>>>>>>>>> I will see the results later.
>>>>>>>>>
>>>>>>>>> FYI. The node does have the 10 minutes zookeeper.session.timeout
>>>>>>>>>
>>>>>>>> value
>>>>>
>>>>>>  in
>>>>>>>>
>>>>>>>>> place.
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>> zhenyu
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2009 at 3:21 PM, stack <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>  On Thu, Nov 12, 2009 at 11:50 AM, Zhenyu Zhong <
>>>>>>>>>>
>>>>>>>>> [email protected]
>>>>>>>>
>>>>>>>>>  wrote:
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> In my cluster, half of the cluster have 2 disks 400GB each per
>>>>>>>>>>>
>>>>>>>>>> machine,
>>>>>>>>
>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>> half of the cluster have 6 disks per machine.  Maybe we should
>>>>>>>>>>>
>>>>>>>>>> run
>>>>>
>>>>>>  zookeeper
>>>>>>>>>>> on the machines with 2 disks and RS on machines with 6 disks?
>>>>>>>>>>>
>>>>>>>>>>> That would make most sense only in the below, it looks like the
>>>>>>>>>>>
>>>>>>>>>> RS
>>>>>
>>>>>>  that
>>>>>>>>
>>>>>>>>> had
>>>>>>>>>> issue had 4 disks?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  BTW, the 10 minutes zookeeper.session.timeout has been set during
>>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>
>>>>>>>>>  experiment.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> And for sure this node had it in place?
>>>>>>>>>> St.Ack
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> thanks
>>>>>>>>>>> zhenyu
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Nov 12, 2009 at 2:08 PM, stack <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>  On Thu, Nov 12, 2009 at 8:40 AM, Zhenyu Zhong <
>>>>>>>>>>>>
>>>>>>>>>>> [email protected]
>>>>>>>>>
>>>>>>>>>>  wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Though I experienced 2 regionserver disconnection this
>>>>>>>>>>>>>
>>>>>>>>>>>> morning,
>>>>>
>>>>>>  it
>>>>>>>>
>>>>>>>>>  looks
>>>>>>>>>>>
>>>>>>>>>>>> better from the regionserver log. (please see the following
>>>>>>>>>>>>>
>>>>>>>>>>>> log)
>>>>>
>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://pastebin.com/m496dbfae
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I checked diskIO for one of the regionserver(192.168.100.116)
>>>>>>>>>>>>>
>>>>>>>>>>>> during
>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>> time it disconnected.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Time: 03:04:58 AM
>>>>>>>>>>>>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read
>>>>>>>>>>>>>
>>>>>>>>>>>> Blk_wrtn
>>>>>>>>>>
>>>>>>>>>>>  sda             105.31      5458.83     19503.64 9043873239
>>>>>>>>>>>>>
>>>>>>>>>>>> 32312473676
>>>>>>>>>>
>>>>>>>>>>>  sda1              2.90         6.64        99.25   10993934
>>>>>>>>>>>>>
>>>>>>>>>>>> 164433464
>>>>>>>>>>
>>>>>>>>>>>  sda2              1.72        23.76        30.25   39365817
>>>>>>>>>>>>>
>>>>>>>>>>>> 50124008
>>>>>>>>>>
>>>>>>>>>>>  sda3              0.30         0.38         3.58     624930
>>>>>>>>>>>>>
>>>>>>>>>>>> 5923000
>>>>>>>>>>
>>>>>>>>>>>  sda4            100.39      5428.06     19370.56 8992888270
>>>>>>>>>>>>>
>>>>>>>>>>>> 32091993204
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Is this high for you?  20k blocks/second would seem to be high
>>>>>>>>>>>>
>>>>>>>>>>> but
>>>>>
>>>>>>  its
>>>>>>>>>
>>>>>>>>>> one
>>>>>>>>>>>
>>>>>>>>>>>> disk only and its not being shared by zk anymore so shouldn't
>>>>>>>>>>>>
>>>>>>>>>>> matter?
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  I also checked the zookeeper quorum server that the
>>>>>>>>>>>>>
>>>>>>>>>>>> regionserver
>>>>>
>>>>>>  tried
>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>>> connect according to the log. However, I don't see
>>>>>>>>>>>>>
>>>>>>>>>>>> 192.168.100.116
>>>>>>>>
>>>>>>>>> in
>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>> client list of the zookeeper quorum server that the
>>>>>>>>>>>>>
>>>>>>>>>>>> regionserver
>>>>>
>>>>>>  tried
>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>>> connect.
>>>>>>>>>>>>> Would that be a problem?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Is that because the ephemeral node for the regionserver had
>>>>>>>>>>>>
>>>>>>>>>>> evaporated?
>>>>>>>>>
>>>>>>>>>>  Lost
>>>>>>>>>>>> its lease w/ zk by the time you went to look?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Thu Nov 12 15:04:52 UTC 2009
>>>>>>>>>>>>> Zookeeper version: 3.2.1-808558, built on 08/27/2009 18:48
>>>>>>>>>>>>>
>>>>>>>>>>>> GMT
>>>>>
>>>>>>   Clients:
>>>>>>>>>>>>> /192.168.100.127:43045[1](queued=0,recved=26,sent=0)
>>>>>>>>>>>>> /192.168.100.131:39091[1](queued=0,recved=964,sent=0)
>>>>>>>>>>>>> /192.168.100.124:35961[1](queued=0,recved=3266,sent=0)
>>>>>>>>>>>>> /192.168.100.123:47935[1](queued=0,recved=233,sent=0)
>>>>>>>>>>>>> /192.168.100.125:46931[1](queued=0,recved=2,sent=0)
>>>>>>>>>>>>> /192.168.100.118:54924[1](queued=0,recved=295,sent=0)
>>>>>>>>>>>>> /192.168.100.118:41390[1](queued=0,recved=2290,sent=0)
>>>>>>>>>>>>> /192.168.100.136:42243[1](queued=0,recved=0,sent=0)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Latency min/avg/max: 0/17/6333
>>>>>>>>>>>>> Received: 47111
>>>>>>>>>>>>> Sent: 0
>>>>>>>>>>>>> Outstanding: 0
>>>>>>>>>>>>> Zxid: 0x77000083f4
>>>>>>>>>>>>> Mode: leader
>>>>>>>>>>>>> Node count: 23
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> That 6 second maximum latency is pretty bad but should be well
>>>>>>>>>>>>
>>>>>>>>>>> within
>>>>>>>>
>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>> zk
>>>>>>>>>>>> session timeout.
>>>>>>>>>>>>
>>>>>>>>>>>> So, problem is likely on the zk client side of the session;
>>>>>>>>>>>>
>>>>>>>>>>> i.e.
>>>>>
>>>>>>  in
>>>>>>>>
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>>> RS.
>>>>>>>>>>>
>>>>>>>>>>>> You could enable GC logging as J-D suggested to see if you have
>>>>>>>>>>>>
>>>>>>>>>>> any
>>>>>>>>
>>>>>>>>> big
>>>>>>>>>
>>>>>>>>>>  pauses, pauses > zk session timeout.
>>>>>>>>>>>>
>>>>>>>>>>>> When the RS went down, it didn't look too heavily loaded:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  1. 2009-11-12 15:04:52,830 INFO
>>>>>>>>>>>>  org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of
>>>>>>>>>>>>
>>>>>>>>>>> metrics:
>>>>>>>>>
>>>>>>>>>>   request=1.5166667, regions=322, stores=657, storefiles=631,
>>>>>>>>>>>>  storefileIndexSize=61, memstoreSize=1472, usedHeap=2819,
>>>>>>>>>>>>
>>>>>>>>>>> maxHeap=4079,
>>>>>>>>>>
>>>>>>>>>>>  blockCacheSize=658110960, blockCacheFree=197395984,
>>>>>>>>>>>>
>>>>>>>>>>> blockCacheCount=9903,
>>>>>>>>>>>
>>>>>>>>>>>>  blockCacheHitRatio=99
>>>>>>>>>>>>
>>>>>>>>>>>> Its serving a few reads?  The number of store files seems fine.
>>>>>>>>>>>>
>>>>>>>>>>> Not
>>>>>>>>
>>>>>>>>> too
>>>>>>>>>>
>>>>>>>>>>> much memory used.
>>>>>>>>>>>>
>>>>>>>>>>>> Looking at the logs, I see the Lease Still Held exception.
>>>>>>>>>>>>
>>>>>>>>>>> This
>>>>>
>>>>>>  happens
>>>>>>>>>>
>>>>>>>>>>> when the RS does its regular report to the master but the
>>>>>>>>>>>>
>>>>>>>>>>> master
>>>>>
>>>>>>  thinks
>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>> RS has since restarted.  It'll think this probably because it
>>>>>>>>>>>>
>>>>>>>>>>> noticed
>>>>>>>>
>>>>>>>>>  that
>>>>>>>>>>>
>>>>>>>>>>>> the RS's znode in zk had gone away and it considered the RS
>>>>>>>>>>>>
>>>>>>>>>>> dead.
>>>>>
>>>>>>
>>>>>>>>>>>> Looking too at your  logs I see this gap in the zk pinging:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  1. 2009-11-12 15:03:39,325 DEBUG
>>>>>>>>>>>>
>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>
>>>>>>>>> Got
>>>>>>>>>
>>>>>>>>>>   ping response for sessionid:0x224e55436ad0004 after 0ms
>>>>>>>>>>>>  2. 2009-11-12 15:03:43,113 DEBUG
>>>>>>>>>>>>
>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>
>>>>>>>>> Got
>>>>>>>>>
>>>>>>>>>>   ping response for sessionid:0x24e55436a0007d after 0ms
>>>>>>>>>>>>
>>>>>>>>>>>> Where in the lines above it, its reporting about every ten
>>>>>>>>>>>>
>>>>>>>>>>> seconds,
>>>>>>>>
>>>>>>>>> here
>>>>>>>>>>
>>>>>>>>>>> there is a big gap.
>>>>>>>>>>>>
>>>>>>>>>>>> Do you have ganglia or something that will let you look more
>>>>>>>>>>>>
>>>>>>>>>>> into
>>>>>
>>>>>>  what
>>>>>>>>>
>>>>>>>>>> was
>>>>>>>>>>>
>>>>>>>>>>>> happening on this machine around this time?  Is the machine OK?
>>>>>>>>>>>>
>>>>>>>>>>> It
>>>>>>>>
>>>>>>>>> looks
>>>>>>>>>>
>>>>>>>>>>> lightly loaded and you have your cluster nicely laid out.
>>>>>>>>>>>>
>>>>>>>>>>> Something
>>>>>>>>
>>>>>>>>> odd
>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>>> going on.  What about things like the write speed to disk?  In
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>
>>>>>>  past
>>>>>>>>>
>>>>>>>>>>  strange issues have been explained by incorrectly set BIOS
>>>>>>>>>>>>
>>>>>>>>>>> which
>>>>>
>>>>>>  made
>>>>>>>>
>>>>>>>>>  disks
>>>>>>>>>>>
>>>>>>>>>>>> run at 1/100th of their proper speed.
>>>>>>>>>>>>
>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58 PM, Zhenyu Zhong <
>>>>>>>>>>>>>
>>>>>>>>>>>> [email protected]
>>>>>>>>>>
>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Stack
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am very appreciated for your comments.
>>>>>>>>>>>>>> I will use the zookeeper monitoring script on my cluster
>>>>>>>>>>>>>>
>>>>>>>>>>>>> and
>>>>>
>>>>>>  let
>>>>>>>>
>>>>>>>>> it
>>>>>>>>>
>>>>>>>>>> run
>>>>>>>>>>>
>>>>>>>>>>>>  overnight to see the result.
>>>>>>>>>>>>>> I will follow up the thread when I get anything.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> thanks
>>>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:52 PM, stack <[email protected]>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>>>  I see these in your log too:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  1. 2009-11-11 04:27:19,018 DEBUG
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>
>>>>>>>>>>> Got
>>>>>>>>>>>>
>>>>>>>>>>>>>   ping response for sessionid:0x424dfd908c50009 after
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 4544ms
>>>>>
>>>>>>    2. 2009-11-11 04:27:19,018 DEBUG
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>
>>>>>>>>>>> Got
>>>>>>>>>>>>
>>>>>>>>>>>>>   ping response for sessionid:0x24dfd90c810002 after
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 4548ms
>>>>>
>>>>>>    3. 2009-11-11 04:27:43,960 DEBUG
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>
>>>>>>>>>>> Got
>>>>>>>>>>>>
>>>>>>>>>>>>>   ping response for sessionid:0x424dfd908c50009 after
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 9030ms
>>>>>
>>>>>>    4. 2009-11-11 04:27:43,960 DEBUG
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> org.apache.zookeeper.
>>>>>>>>>>
>>>>>>>>>
>>
>

Re: regionserver disconnection

Reply via email to