Stack, Thanks a lot!
I found out that the reason why HBase doesn't take the system file descriptor value I set before. I started the HBase using root instead of the normal hadoop user, while my system configures hadoop with higher file descriptor value but configures root with 1024 default value. Now my system has a clean start. However, it seems that the zookeeper.session.timeout value doesn't take into effect. I still found around 60 seconds pauses from the disconnected regionserver. I really don't know why regionserver only times out after 60 seconds instead of 10 minutes which I set for zookeeper.session.timeout. Is there any other timeout value coming into play before the actual session times out? zhenyu On Fri, Nov 13, 2009 at 7:08 PM, stack <[email protected]> wrote: > Ok. Lack of file descriptors manifests in all kinds of weird ways. > Hopefully thats it. If not, lets keep digging. > St.Ack > > On Fri, Nov 13, 2009 at 3:44 PM, Zhenyu Zhong <[email protected] > >wrote: > > > Stack, > > > > You are right, the master started with ulimit -n 1024. It doesn't take > the > > system value. > > > > Regarding to the too many open files, it looks like the same as the one > J-D > > put up. But I will get the Master start with higher value first and see > if > > this kind of error goes away. > > > > thanks a lot! > > zhenyu > > > > On Fri, Nov 13, 2009 at 6:02 PM, stack <[email protected]> wrote: > > > > > Does it say > > > > > > ulimit -n 32768 > > > > > > ...as the second line in your log file on start of the master? > > > > > > Is the stack trace that complains about too many open files same as the > > one > > > in the blog post J-D put up? > > > > > > St.Ack > > > > > > > > > On Fri, Nov 13, 2009 at 1:37 PM, Zhenyu Zhong <[email protected] > > > >wrote: > > > > > > > The ulimit file descriptors was set to fs.file-max =1578334, also in > > the > > > > limits.conf the value was set to 32768. > > > > So these are way higher than the open descriptors for the running > > > > processes. > > > > > > > > thanks > > > > zhenyu > > > > > > > > > > > > On Fri, Nov 13, 2009 at 4:33 PM, Stack <[email protected]> wrote: > > > > > > > > > You upped the ulimit file descriptors as per the getting started > doc? > > > > > > > > > > > > > > > > > > > > On Nov 13, 2009, at 1:26 PM, Zhenyu Zhong <[email protected] > > > > > > wrote: > > > > > > > > > > Thanks a lot. > > > > >> > > > > >> > > > > >> Bad news is my kernel is still 2.6.26. > > > > >> But it was not a problem before. > > > > >> > > > > >> Very strange. > > > > >> > > > > >> zhenyu > > > > >> > > > > >> On Fri, Nov 13, 2009 at 4:16 PM, Jean-Daniel Cryans < > > > > [email protected] > > > > >> >wrote: > > > > >> > > > > >> Looks like > > > > >>> > > > > >>> > > > > > > > > > > http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/ > > > > >>> > > > > >>> J-D > > > > >>> > > > > >>> On Fri, Nov 13, 2009 at 1:12 PM, Zhenyu Zhong < > > > [email protected] > > > > > > > > > >>> wrote: > > > > >>> > > > > >>>> Hi, > > > > >>>> > > > > >>>> > > > > >>>> After I re-organize the cluster, the experiment ran into problem > > > > faster > > > > >>>> > > > > >>> than > > > > >>> > > > > >>>> before. > > > > >>>> > > > > >>>> Basically the changes are to use machines with less resources as > > > > >>>> > > > > >>> zookeeper > > > > >>> > > > > >>>> quorums and machines with more resources as regionserver. > > > > >>>> > > > > >>>> From the log, I still see the pause around 1 minute. > > > > >>>> I enabled the GC logging, please see > > > > >>>> > > > > >>>> http://pastebin.com/m1d4ce0f1 > > > > >>>> > > > > >>>> for details. > > > > >>>> In general I don't see many pauses in the GC. > > > > >>>> > > > > >>>> What is more interesting, 1 hour after the 1st regionserver > > > > >>>> disconnected, > > > > >>>> the master log complained about too many open files. This didn't > > > > happen > > > > >>>> before. > > > > >>>> I checked the system OS setup as well as the limits.conf. I also > > > > checked > > > > >>>> > > > > >>> the > > > > >>> > > > > >>>> running processes. The total open files don't reach the limit. I > > am > > > > >>>> > > > > >>> confused > > > > >>> > > > > >>>> a bit. > > > > >>>> > > > > >>>> > > > > >>>> Please see the following master log. > > > > >>>> > > > > >>>> 2009-11-13 20:06:02,114 INFO > > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > >>>> RegionManager.metaScanner scan of 4658 row(s) of meta region > > > {server: > > > > >>>> 192.168.100.128:60021, regionname: .META.,,1, startKey: <>} > > > complete > > > > >>>> 2009-11-13 20:06:02,114 INFO > > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > >>>> > > > > >>> All > > > > >>> > > > > >>>> 1 .META. region(s) scanned > > > > >>>> 2009-11-13 20:06:07,677 DEBUG org.apache.zookeeper.ClientCnxn: > Got > > > > ping > > > > >>>> response for sessionid:0x424eebf1c10004c after 3ms > > > > >>>> 2009-11-13 20:06:08,178 INFO org.apache.hadoop.hdfs.DFSClient: > > > > Exception > > > > >>>> > > > > >>> in > > > > >>> > > > > >>>> createBlockOutputStream java.io.IOException: Bad connect ack > with > > > > >>>> firstBadLink 192.168.100.123:50010 > > > > >>>> 2009-11-13 20:06:08,178 INFO org.apache.hadoop.hdfs.DFSClient: > > > > >>>> Abandoning > > > > >>>> block blk_-2808245019291145247_5478039 > > > > >>>> 2009-11-13 20:06:09,682 INFO org.apache.hadoop.hdfs.DFSClient: > > > > Exception > > > > >>>> > > > > >>> in > > > > >>> > > > > >>>> createBlockOutputStream java.io.EOFException > > > > >>>> 2009-11-13 20:06:09,682 INFO org.apache.hadoop.hdfs.DFSClient: > > > > >>>> Abandoning > > > > >>>> block blk_1074853606841896259_5478048 > > > > >>>> 2009-11-13 20:06:10,334 DEBUG org.apache.zookeeper.ClientCnxn: > Got > > > > ping > > > > >>>> response for sessionid:0x24eebf1043003c after 1ms > > > > >>>> 2009-11-13 20:06:21,018 DEBUG org.apache.zookeeper.ClientCnxn: > Got > > > > ping > > > > >>>> response for sessionid:0x424eebf1c10004c after 0ms > > > > >>>> 2009-11-13 20:06:23,674 DEBUG org.apache.zookeeper.ClientCnxn: > Got > > > > ping > > > > >>>> response for sessionid:0x24eebf1043003c after 0ms > > > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: > > > > Exception > > > > >>>> > > > > >>> in > > > > >>> > > > > >>>> createBlockOutputStream java.io.IOException: Bad connect ack > with > > > > >>>> firstBadLink 192.168.100.123:50010 > > > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: > > > > >>>> Abandoning > > > > >>>> block blk_-6642544517082142289_5478063 > > > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: > > > > Exception > > > > >>>> > > > > >>> in > > > > >>> > > > > >>>> createBlockOutputStream java.net.SocketException: Too many open > > > files > > > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: > > > > >>>> Abandoning > > > > >>>> block blk_2057511041109796090_5478063 > > > > >>>> 2009-11-13 20:06:24,928 INFO org.apache.hadoop.hdfs.DFSClient: > > > > Exception > > > > >>>> > > > > >>> in > > > > >>> > > > > >>>> createBlockOutputStream java.net.SocketException: Too many open > > > files > > > > >>>> 2009-11-13 20:06:24,928 INFO org.apache.hadoop.hdfs.DFSClient: > > > > >>>> Abandoning > > > > >>>> block blk_8219260302213892894_5478064 > > > > >>>> 2009-11-13 20:06:30,855 INFO org.apache.hadoop.hdfs.DFSClient: > > > > Exception > > > > >>>> > > > > >>> in > > > > >>> > > > > >>>> createBlockOutputStream java.net.SocketException: Too many open > > > files > > > > >>>> 2009-11-13 20:06:30,855 INFO org.apache.hadoop.hdfs.DFSClient: > > > > >>>> Abandoning > > > > >>>> block blk_1669205542853067709_5478235 > > > > >>>> 2009-11-13 20:06:30,905 INFO org.apache.hadoop.hdfs.DFSClient: > > > > Exception > > > > >>>> > > > > >>> in > > > > >>> > > > > >>>> createBlockOutputStream java.net.SocketException: Too many open > > > files > > > > >>>> 2009-11-13 20:06:30,905 INFO org.apache.hadoop.hdfs.DFSClient: > > > > >>>> Abandoning > > > > >>>> block blk_9128897691346270351_5478237 > > > > >>>> 2009-11-13 20:06:30,955 INFO org.apache.hadoop.hdfs.DFSClient: > > > > Exception > > > > >>>> > > > > >>> in > > > > >>> > > > > >>>> createBlockOutputStream java.net.SocketException: Too many open > > > files > > > > >>>> 2009-11-13 20:06:30,955 INFO org.apache.hadoop.hdfs.DFSClient: > > > > >>>> Abandoning > > > > >>>> block blk_1116845144864123018_5478240 > > > > >>>> 2009-11-13 20:06:34,372 DEBUG org.apache.zookeeper.ClientCnxn: > Got > > > > ping > > > > >>>> response for sessionid:0x424eebf1c10004c after 0ms > > > > >>>> 2009-11-13 20:06:37,034 DEBUG org.apache.zookeeper.ClientCnxn: > Got > > > > ping > > > > >>>> response for sessionid:0x24eebf1043003c after 0ms > > > > >>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: > > > > >>>> > > > > >>> DataStreamer > > > > >>> > > > > >>>> Exception: java.io.IOException: Too many open files > > > > >>>> at sun.nio.ch.IOUtil.initPipe(Native Method) > > > > >>>> at > > > sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49) > > > > >>>> at > > > > >>>> > > > > >>>> > > > > > > > > > > sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18) > > > > >>> > > > > >>>> at > > > > >>>> > > > > >>>> > > > > > > > > > > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407) > > > > >>> > > > > >>>> at > > > > >>>> > > > > >>>> > > > > > > > > > > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322) > > > > >>> > > > > >>>> at > > > > >>>> > > > > >>>> > > > > > > > > > > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > > > > >>> > > > > >>>> at > > > > >>>> > > > > >>>> > > > > > > > > > > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > > > > >>> > > > > >>>> at > > > > >>>> > > > > >>>> > > > > > > > > > > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > > > > >>> > > > > >>>> at > > > > >>>> > > > > >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) > > > > >>> > > > > >>>> at java.io.DataOutputStream.write(DataOutputStream.java:90) > > > > >>>> at > > > > >>>> > > > > >>>> > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290) > > > > >>> > > > > >>>> > > > > >>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: > > Error > > > > >>>> Recovery for block blk_8148813491785406356_5478475 bad > datanode[0] > > > > >>>> 192.168.100.123:50010 > > > > >>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: > > Error > > > > >>>> Recovery for block blk_8148813491785406356_5478475 in pipeline > > > > >>>> 192.168.100.123:50010, 192.168.100.134:50010, > > 192.168.100.122:50010 > > > : > > > > >>>> bad > > > > >>>> datanode 192.168.100.123:50010 > > > > >>>> 2009-11-13 20:06:37,436 INFO org.apache.hadoop.hdfs.DFSClient: > > > > Exception > > > > >>>> > > > > >>> in > > > > >>> > > > > >>>> createBlockOutputStream java.net.SocketException: Too many open > > > files > > > > >>>> 2009-11-13 20:06:37,436 INFO org.apache.hadoop.hdfs.DFSClient: > > > > >>>> Abandoning > > > > >>>> block blk_2119727700857186236_5478498 > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> On Thu, Nov 12, 2009 at 4:21 PM, Zhenyu Zhong < > > > > [email protected] > > > > >>>> wrote: > > > > >>>> > > > > >>>> Will do. > > > > >>>>> > > > > >>>>> thanks > > > > >>>>> zhenyu > > > > >>>>> > > > > >>>>> > > > > >>>>> On Thu, Nov 12, 2009 at 3:33 PM, stack <[email protected]> > wrote: > > > > >>>>> > > > > >>>>> Enable GC logging too on this next run (see hbase-env.sh). > Lets > > > try > > > > >>>>>> > > > > >>>>> and > > > > >>> > > > > >>>> get > > > > >>>>>> to the bottom of whats going on. > > > > >>>>>> Thanks, > > > > >>>>>> St.Ack > > > > >>>>>> > > > > >>>>>> On Thu, Nov 12, 2009 at 12:29 PM, Zhenyu Zhong < > > > > >>>>>> > > > > >>>>> [email protected] > > > > >>> > > > > >>>> wrote: > > > > >>>>>>> > > > > >>>>>> > > > > >>>>>> I can switch the boxes that run zookeeper with the ones that > > run > > > > >>>>>>> regionservers. > > > > >>>>>>> I will see the results later. > > > > >>>>>>> > > > > >>>>>>> FYI. The node does have the 10 minutes > > zookeeper.session.timeout > > > > >>>>>>> > > > > >>>>>> value > > > > >>> > > > > >>>> in > > > > >>>>>> > > > > >>>>>>> place. > > > > >>>>>>> > > > > >>>>>>> thanks > > > > >>>>>>> zhenyu > > > > >>>>>>> > > > > >>>>>>> On Thu, Nov 12, 2009 at 3:21 PM, stack <[email protected]> > > wrote: > > > > >>>>>>> > > > > >>>>>>> On Thu, Nov 12, 2009 at 11:50 AM, Zhenyu Zhong < > > > > >>>>>>>> > > > > >>>>>>> [email protected] > > > > >>>>>> > > > > >>>>>>> wrote: > > > > >>>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>>> In my cluster, half of the cluster have 2 disks 400GB each > > per > > > > >>>>>>>>> > > > > >>>>>>>> machine, > > > > >>>>>> > > > > >>>>>>> and > > > > >>>>>>>> > > > > >>>>>>>>> half of the cluster have 6 disks per machine. Maybe we > > should > > > > >>>>>>>>> > > > > >>>>>>>> run > > > > >>> > > > > >>>> zookeeper > > > > >>>>>>>>> on the machines with 2 disks and RS on machines with 6 > disks? > > > > >>>>>>>>> > > > > >>>>>>>>> That would make most sense only in the below, it looks like > > the > > > > >>>>>>>>> > > > > >>>>>>>> RS > > > > >>> > > > > >>>> that > > > > >>>>>> > > > > >>>>>>> had > > > > >>>>>>>> issue had 4 disks? > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> BTW, the 10 minutes zookeeper.session.timeout has been set > > > during > > > > >>>>>>>>> > > > > >>>>>>>> the > > > > >>>>>> > > > > >>>>>>> experiment. > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>> And for sure this node had it in place? > > > > >>>>>>>> St.Ack > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>>> thanks > > > > >>>>>>>>> zhenyu > > > > >>>>>>>>> > > > > >>>>>>>>> On Thu, Nov 12, 2009 at 2:08 PM, stack <[email protected]> > > > wrote: > > > > >>>>>>>>> > > > > >>>>>>>>> On Thu, Nov 12, 2009 at 8:40 AM, Zhenyu Zhong < > > > > >>>>>>>>>> > > > > >>>>>>>>> [email protected] > > > > >>>>>>> > > > > >>>>>>>> wrote: > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> Though I experienced 2 regionserver disconnection this > > > > >>>>>>>>>>> > > > > >>>>>>>>>> morning, > > > > >>> > > > > >>>> it > > > > >>>>>> > > > > >>>>>>> looks > > > > >>>>>>>>> > > > > >>>>>>>>>> better from the regionserver log. (please see the > following > > > > >>>>>>>>>>> > > > > >>>>>>>>>> log) > > > > >>> > > > > >>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> http://pastebin.com/m496dbfae > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> I checked diskIO for one of the > > regionserver(192.168.100.116) > > > > >>>>>>>>>>> > > > > >>>>>>>>>> during > > > > >>>>>>> > > > > >>>>>>>> the > > > > >>>>>>>>> > > > > >>>>>>>>>> time it disconnected. > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Time: 03:04:58 AM > > > > >>>>>>>>>>> Device: tps Blk_read/s Blk_wrtn/s > Blk_read > > > > >>>>>>>>>>> > > > > >>>>>>>>>> Blk_wrtn > > > > >>>>>>>> > > > > >>>>>>>>> sda 105.31 5458.83 19503.64 9043873239 > > > > >>>>>>>>>>> > > > > >>>>>>>>>> 32312473676 > > > > >>>>>>>> > > > > >>>>>>>>> sda1 2.90 6.64 99.25 10993934 > > > > >>>>>>>>>>> > > > > >>>>>>>>>> 164433464 > > > > >>>>>>>> > > > > >>>>>>>>> sda2 1.72 23.76 30.25 39365817 > > > > >>>>>>>>>>> > > > > >>>>>>>>>> 50124008 > > > > >>>>>>>> > > > > >>>>>>>>> sda3 0.30 0.38 3.58 624930 > > > > >>>>>>>>>>> > > > > >>>>>>>>>> 5923000 > > > > >>>>>>>> > > > > >>>>>>>>> sda4 100.39 5428.06 19370.56 8992888270 > > > > >>>>>>>>>>> > > > > >>>>>>>>>> 32091993204 > > > > >>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Is this high for you? 20k blocks/second would seem to > be > > > high > > > > >>>>>>>>>> > > > > >>>>>>>>> but > > > > >>> > > > > >>>> its > > > > >>>>>>> > > > > >>>>>>>> one > > > > >>>>>>>>> > > > > >>>>>>>>>> disk only and its not being shared by zk anymore so > > shouldn't > > > > >>>>>>>>>> > > > > >>>>>>>>> matter? > > > > >>>>>> > > > > >>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> I also checked the zookeeper quorum server that the > > > > >>>>>>>>>>> > > > > >>>>>>>>>> regionserver > > > > >>> > > > > >>>> tried > > > > >>>>>>>> > > > > >>>>>>>>> to > > > > >>>>>>>>> > > > > >>>>>>>>>> connect according to the log. However, I don't see > > > > >>>>>>>>>>> > > > > >>>>>>>>>> 192.168.100.116 > > > > >>>>>> > > > > >>>>>>> in > > > > >>>>>>> > > > > >>>>>>>> the > > > > >>>>>>>>> > > > > >>>>>>>>>> client list of the zookeeper quorum server that the > > > > >>>>>>>>>>> > > > > >>>>>>>>>> regionserver > > > > >>> > > > > >>>> tried > > > > >>>>>>>> > > > > >>>>>>>>> to > > > > >>>>>>>>> > > > > >>>>>>>>>> connect. > > > > >>>>>>>>>>> Would that be a problem? > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Is that because the ephemeral node for the regionserver > > had > > > > >>>>>>>>>> > > > > >>>>>>>>> evaporated? > > > > >>>>>>> > > > > >>>>>>>> Lost > > > > >>>>>>>>>> its lease w/ zk by the time you went to look? > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> Thu Nov 12 15:04:52 UTC 2009 > > > > >>>>>>>>>>> Zookeeper version: 3.2.1-808558, built on 08/27/2009 > 18:48 > > > > >>>>>>>>>>> > > > > >>>>>>>>>> GMT > > > > >>> > > > > >>>> Clients: > > > > >>>>>>>>>>> /192.168.100.127:43045[1](queued=0,recved=26,sent=0) > > > > >>>>>>>>>>> /192.168.100.131:39091[1](queued=0,recved=964,sent=0) > > > > >>>>>>>>>>> /192.168.100.124:35961[1](queued=0,recved=3266,sent=0) > > > > >>>>>>>>>>> /192.168.100.123:47935[1](queued=0,recved=233,sent=0) > > > > >>>>>>>>>>> /192.168.100.125:46931[1](queued=0,recved=2,sent=0) > > > > >>>>>>>>>>> /192.168.100.118:54924[1](queued=0,recved=295,sent=0) > > > > >>>>>>>>>>> /192.168.100.118:41390[1](queued=0,recved=2290,sent=0) > > > > >>>>>>>>>>> /192.168.100.136:42243[1](queued=0,recved=0,sent=0) > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Latency min/avg/max: 0/17/6333 > > > > >>>>>>>>>>> Received: 47111 > > > > >>>>>>>>>>> Sent: 0 > > > > >>>>>>>>>>> Outstanding: 0 > > > > >>>>>>>>>>> Zxid: 0x77000083f4 > > > > >>>>>>>>>>> Mode: leader > > > > >>>>>>>>>>> Node count: 23 > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> That 6 second maximum latency is pretty bad but should be > > well > > > > >>>>>>>>>> > > > > >>>>>>>>> within > > > > >>>>>> > > > > >>>>>>> the > > > > >>>>>>>> > > > > >>>>>>>>> zk > > > > >>>>>>>>>> session timeout. > > > > >>>>>>>>>> > > > > >>>>>>>>>> So, problem is likely on the zk client side of the > session; > > > > >>>>>>>>>> > > > > >>>>>>>>> i.e. > > > > >>> > > > > >>>> in > > > > >>>>>> > > > > >>>>>>> the > > > > >>>>>>> > > > > >>>>>>>> RS. > > > > >>>>>>>>> > > > > >>>>>>>>>> You could enable GC logging as J-D suggested to see if you > > > have > > > > >>>>>>>>>> > > > > >>>>>>>>> any > > > > >>>>>> > > > > >>>>>>> big > > > > >>>>>>> > > > > >>>>>>>> pauses, pauses > zk session timeout. > > > > >>>>>>>>>> > > > > >>>>>>>>>> When the RS went down, it didn't look too heavily loaded: > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> 1. 2009-11-12 15:04:52,830 INFO > > > > >>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump > of > > > > >>>>>>>>>> > > > > >>>>>>>>> metrics: > > > > >>>>>>> > > > > >>>>>>>> request=1.5166667, regions=322, stores=657, storefiles=631, > > > > >>>>>>>>>> storefileIndexSize=61, memstoreSize=1472, usedHeap=2819, > > > > >>>>>>>>>> > > > > >>>>>>>>> maxHeap=4079, > > > > >>>>>>>> > > > > >>>>>>>>> blockCacheSize=658110960, blockCacheFree=197395984, > > > > >>>>>>>>>> > > > > >>>>>>>>> blockCacheCount=9903, > > > > >>>>>>>>> > > > > >>>>>>>>>> blockCacheHitRatio=99 > > > > >>>>>>>>>> > > > > >>>>>>>>>> Its serving a few reads? The number of store files seems > > > fine. > > > > >>>>>>>>>> > > > > >>>>>>>>> Not > > > > >>>>>> > > > > >>>>>>> too > > > > >>>>>>>> > > > > >>>>>>>>> much memory used. > > > > >>>>>>>>>> > > > > >>>>>>>>>> Looking at the logs, I see the Lease Still Held exception. > > > > >>>>>>>>>> > > > > >>>>>>>>> This > > > > >>> > > > > >>>> happens > > > > >>>>>>>> > > > > >>>>>>>>> when the RS does its regular report to the master but the > > > > >>>>>>>>>> > > > > >>>>>>>>> master > > > > >>> > > > > >>>> thinks > > > > >>>>>>> > > > > >>>>>>>> the > > > > >>>>>>>>> > > > > >>>>>>>>>> RS has since restarted. It'll think this probably because > > it > > > > >>>>>>>>>> > > > > >>>>>>>>> noticed > > > > >>>>>> > > > > >>>>>>> that > > > > >>>>>>>>> > > > > >>>>>>>>>> the RS's znode in zk had gone away and it considered the > RS > > > > >>>>>>>>>> > > > > >>>>>>>>> dead. > > > > >>> > > > > >>>> > > > > >>>>>>>>>> Looking too at your logs I see this gap in the zk > pinging: > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> 1. 2009-11-12 15:03:39,325 DEBUG > > > > >>>>>>>>>> > > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn: > > > > >>>>>> > > > > >>>>>>> Got > > > > >>>>>>> > > > > >>>>>>>> ping response for sessionid:0x224e55436ad0004 after 0ms > > > > >>>>>>>>>> 2. 2009-11-12 15:03:43,113 DEBUG > > > > >>>>>>>>>> > > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn: > > > > >>>>>> > > > > >>>>>>> Got > > > > >>>>>>> > > > > >>>>>>>> ping response for sessionid:0x24e55436a0007d after 0ms > > > > >>>>>>>>>> > > > > >>>>>>>>>> Where in the lines above it, its reporting about every ten > > > > >>>>>>>>>> > > > > >>>>>>>>> seconds, > > > > >>>>>> > > > > >>>>>>> here > > > > >>>>>>>> > > > > >>>>>>>>> there is a big gap. > > > > >>>>>>>>>> > > > > >>>>>>>>>> Do you have ganglia or something that will let you look > more > > > > >>>>>>>>>> > > > > >>>>>>>>> into > > > > >>> > > > > >>>> what > > > > >>>>>>> > > > > >>>>>>>> was > > > > >>>>>>>>> > > > > >>>>>>>>>> happening on this machine around this time? Is the > machine > > > OK? > > > > >>>>>>>>>> > > > > >>>>>>>>> It > > > > >>>>>> > > > > >>>>>>> looks > > > > >>>>>>>> > > > > >>>>>>>>> lightly loaded and you have your cluster nicely laid out. > > > > >>>>>>>>>> > > > > >>>>>>>>> Something > > > > >>>>>> > > > > >>>>>>> odd > > > > >>>>>>>> > > > > >>>>>>>>> is > > > > >>>>>>>>> > > > > >>>>>>>>>> going on. What about things like the write speed to disk? > > In > > > > >>>>>>>>>> > > > > >>>>>>>>> the > > > > >>> > > > > >>>> past > > > > >>>>>>> > > > > >>>>>>>> strange issues have been explained by incorrectly set BIOS > > > > >>>>>>>>>> > > > > >>>>>>>>> which > > > > >>> > > > > >>>> made > > > > >>>>>> > > > > >>>>>>> disks > > > > >>>>>>>>> > > > > >>>>>>>>>> run at 1/100th of their proper speed. > > > > >>>>>>>>>> > > > > >>>>>>>>>> St.Ack > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>>> Best, > > > > >>>>>>>>>>> zhenyu > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58 PM, Zhenyu Zhong < > > > > >>>>>>>>>>> > > > > >>>>>>>>>> [email protected] > > > > >>>>>>>> > > > > >>>>>>>>> wrote: > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Stack > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> I am very appreciated for your comments. > > > > >>>>>>>>>>>> I will use the zookeeper monitoring script on my cluster > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> and > > > > >>> > > > > >>>> let > > > > >>>>>> > > > > >>>>>>> it > > > > >>>>>>> > > > > >>>>>>>> run > > > > >>>>>>>>> > > > > >>>>>>>>>> overnight to see the result. > > > > >>>>>>>>>>>> I will follow up the thread when I get anything. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> thanks > > > > >>>>>>>>>>>> zhenyu > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:52 PM, stack < > [email protected]> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> wrote: > > > > >>>>>> > > > > >>>>>>> > > > > >>>>>>>>>>>> I see these in your log too: > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> 1. 2009-11-11 04:27:19,018 DEBUG > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn: > > > > >>>>>>>> > > > > >>>>>>>>> Got > > > > >>>>>>>>>> > > > > >>>>>>>>>>> ping response for sessionid:0x424dfd908c50009 after > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> 4544ms > > > > >>> > > > > >>>> 2. 2009-11-11 04:27:19,018 DEBUG > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn: > > > > >>>>>>>> > > > > >>>>>>>>> Got > > > > >>>>>>>>>> > > > > >>>>>>>>>>> ping response for sessionid:0x24dfd90c810002 after > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> 4548ms > > > > >>> > > > > >>>> 3. 2009-11-11 04:27:43,960 DEBUG > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn: > > > > >>>>>>>> > > > > >>>>>>>>> Got > > > > >>>>>>>>>> > > > > >>>>>>>>>>> ping response for sessionid:0x424dfd908c50009 after > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> 9030ms > > > > >>> > > > > >>>> 4. 2009-11-11 04:27:43,960 DEBUG > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> org.apache.zookeeper. > > > > >>>>>>>> > > > > >>>>>>> > > > > > > > > > >
