Ok. Lack of file descriptors manifests in all kinds of weird ways. Hopefully thats it. If not, lets keep digging. St.Ack
On Fri, Nov 13, 2009 at 3:44 PM, Zhenyu Zhong <[email protected]>wrote: > Stack, > > You are right, the master started with ulimit -n 1024. It doesn't take the > system value. > > Regarding to the too many open files, it looks like the same as the one J-D > put up. But I will get the Master start with higher value first and see if > this kind of error goes away. > > thanks a lot! > zhenyu > > On Fri, Nov 13, 2009 at 6:02 PM, stack <[email protected]> wrote: > > > Does it say > > > > ulimit -n 32768 > > > > ...as the second line in your log file on start of the master? > > > > Is the stack trace that complains about too many open files same as the > one > > in the blog post J-D put up? > > > > St.Ack > > > > > > On Fri, Nov 13, 2009 at 1:37 PM, Zhenyu Zhong <[email protected] > > >wrote: > > > > > The ulimit file descriptors was set to fs.file-max =1578334, also in > the > > > limits.conf the value was set to 32768. > > > So these are way higher than the open descriptors for the running > > > processes. > > > > > > thanks > > > zhenyu > > > > > > > > > On Fri, Nov 13, 2009 at 4:33 PM, Stack <[email protected]> wrote: > > > > > > > You upped the ulimit file descriptors as per the getting started doc? > > > > > > > > > > > > > > > > On Nov 13, 2009, at 1:26 PM, Zhenyu Zhong <[email protected]> > > > wrote: > > > > > > > > Thanks a lot. > > > >> > > > >> > > > >> Bad news is my kernel is still 2.6.26. > > > >> But it was not a problem before. > > > >> > > > >> Very strange. > > > >> > > > >> zhenyu > > > >> > > > >> On Fri, Nov 13, 2009 at 4:16 PM, Jean-Daniel Cryans < > > > [email protected] > > > >> >wrote: > > > >> > > > >> Looks like > > > >>> > > > >>> > > > > > > http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/ > > > >>> > > > >>> J-D > > > >>> > > > >>> On Fri, Nov 13, 2009 at 1:12 PM, Zhenyu Zhong < > > [email protected] > > > > > > > >>> wrote: > > > >>> > > > >>>> Hi, > > > >>>> > > > >>>> > > > >>>> After I re-organize the cluster, the experiment ran into problem > > > faster > > > >>>> > > > >>> than > > > >>> > > > >>>> before. > > > >>>> > > > >>>> Basically the changes are to use machines with less resources as > > > >>>> > > > >>> zookeeper > > > >>> > > > >>>> quorums and machines with more resources as regionserver. > > > >>>> > > > >>>> From the log, I still see the pause around 1 minute. > > > >>>> I enabled the GC logging, please see > > > >>>> > > > >>>> http://pastebin.com/m1d4ce0f1 > > > >>>> > > > >>>> for details. > > > >>>> In general I don't see many pauses in the GC. > > > >>>> > > > >>>> What is more interesting, 1 hour after the 1st regionserver > > > >>>> disconnected, > > > >>>> the master log complained about too many open files. This didn't > > > happen > > > >>>> before. > > > >>>> I checked the system OS setup as well as the limits.conf. I also > > > checked > > > >>>> > > > >>> the > > > >>> > > > >>>> running processes. The total open files don't reach the limit. I > am > > > >>>> > > > >>> confused > > > >>> > > > >>>> a bit. > > > >>>> > > > >>>> > > > >>>> Please see the following master log. > > > >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO > > > org.apache.hadoop.hbase.master.BaseScanner: > > > >>>> RegionManager.metaScanner scan of 4658 row(s) of meta region > > {server: > > > >>>> 192.168.100.128:60021, regionname: .META.,,1, startKey: <>} > > complete > > > >>>> 2009-11-13 20:06:02,114 INFO > > > org.apache.hadoop.hbase.master.BaseScanner: > > > >>>> > > > >>> All > > > >>> > > > >>>> 1 .META. region(s) scanned > > > >>>> 2009-11-13 20:06:07,677 DEBUG org.apache.zookeeper.ClientCnxn: Got > > > ping > > > >>>> response for sessionid:0x424eebf1c10004c after 3ms > > > >>>> 2009-11-13 20:06:08,178 INFO org.apache.hadoop.hdfs.DFSClient: > > > Exception > > > >>>> > > > >>> in > > > >>> > > > >>>> createBlockOutputStream java.io.IOException: Bad connect ack with > > > >>>> firstBadLink 192.168.100.123:50010 > > > >>>> 2009-11-13 20:06:08,178 INFO org.apache.hadoop.hdfs.DFSClient: > > > >>>> Abandoning > > > >>>> block blk_-2808245019291145247_5478039 > > > >>>> 2009-11-13 20:06:09,682 INFO org.apache.hadoop.hdfs.DFSClient: > > > Exception > > > >>>> > > > >>> in > > > >>> > > > >>>> createBlockOutputStream java.io.EOFException > > > >>>> 2009-11-13 20:06:09,682 INFO org.apache.hadoop.hdfs.DFSClient: > > > >>>> Abandoning > > > >>>> block blk_1074853606841896259_5478048 > > > >>>> 2009-11-13 20:06:10,334 DEBUG org.apache.zookeeper.ClientCnxn: Got > > > ping > > > >>>> response for sessionid:0x24eebf1043003c after 1ms > > > >>>> 2009-11-13 20:06:21,018 DEBUG org.apache.zookeeper.ClientCnxn: Got > > > ping > > > >>>> response for sessionid:0x424eebf1c10004c after 0ms > > > >>>> 2009-11-13 20:06:23,674 DEBUG org.apache.zookeeper.ClientCnxn: Got > > > ping > > > >>>> response for sessionid:0x24eebf1043003c after 0ms > > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: > > > Exception > > > >>>> > > > >>> in > > > >>> > > > >>>> createBlockOutputStream java.io.IOException: Bad connect ack with > > > >>>> firstBadLink 192.168.100.123:50010 > > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: > > > >>>> Abandoning > > > >>>> block blk_-6642544517082142289_5478063 > > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: > > > Exception > > > >>>> > > > >>> in > > > >>> > > > >>>> createBlockOutputStream java.net.SocketException: Too many open > > files > > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: > > > >>>> Abandoning > > > >>>> block blk_2057511041109796090_5478063 > > > >>>> 2009-11-13 20:06:24,928 INFO org.apache.hadoop.hdfs.DFSClient: > > > Exception > > > >>>> > > > >>> in > > > >>> > > > >>>> createBlockOutputStream java.net.SocketException: Too many open > > files > > > >>>> 2009-11-13 20:06:24,928 INFO org.apache.hadoop.hdfs.DFSClient: > > > >>>> Abandoning > > > >>>> block blk_8219260302213892894_5478064 > > > >>>> 2009-11-13 20:06:30,855 INFO org.apache.hadoop.hdfs.DFSClient: > > > Exception > > > >>>> > > > >>> in > > > >>> > > > >>>> createBlockOutputStream java.net.SocketException: Too many open > > files > > > >>>> 2009-11-13 20:06:30,855 INFO org.apache.hadoop.hdfs.DFSClient: > > > >>>> Abandoning > > > >>>> block blk_1669205542853067709_5478235 > > > >>>> 2009-11-13 20:06:30,905 INFO org.apache.hadoop.hdfs.DFSClient: > > > Exception > > > >>>> > > > >>> in > > > >>> > > > >>>> createBlockOutputStream java.net.SocketException: Too many open > > files > > > >>>> 2009-11-13 20:06:30,905 INFO org.apache.hadoop.hdfs.DFSClient: > > > >>>> Abandoning > > > >>>> block blk_9128897691346270351_5478237 > > > >>>> 2009-11-13 20:06:30,955 INFO org.apache.hadoop.hdfs.DFSClient: > > > Exception > > > >>>> > > > >>> in > > > >>> > > > >>>> createBlockOutputStream java.net.SocketException: Too many open > > files > > > >>>> 2009-11-13 20:06:30,955 INFO org.apache.hadoop.hdfs.DFSClient: > > > >>>> Abandoning > > > >>>> block blk_1116845144864123018_5478240 > > > >>>> 2009-11-13 20:06:34,372 DEBUG org.apache.zookeeper.ClientCnxn: Got > > > ping > > > >>>> response for sessionid:0x424eebf1c10004c after 0ms > > > >>>> 2009-11-13 20:06:37,034 DEBUG org.apache.zookeeper.ClientCnxn: Got > > > ping > > > >>>> response for sessionid:0x24eebf1043003c after 0ms > > > >>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: > > > >>>> > > > >>> DataStreamer > > > >>> > > > >>>> Exception: java.io.IOException: Too many open files > > > >>>> at sun.nio.ch.IOUtil.initPipe(Native Method) > > > >>>> at > > sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49) > > > >>>> at > > > >>>> > > > >>>> > > > > > > sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18) > > > >>> > > > >>>> at > > > >>>> > > > >>>> > > > > > > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407) > > > >>> > > > >>>> at > > > >>>> > > > >>>> > > > > > > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322) > > > >>> > > > >>>> at > > > >>>> > > > >>>> > > > > > > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > > > >>> > > > >>>> at > > > >>>> > > > >>>> > > > > > > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > > > >>> > > > >>>> at > > > >>>> > > > >>>> > > > > > > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > > > >>> > > > >>>> at > > > >>>> > > > >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) > > > >>> > > > >>>> at java.io.DataOutputStream.write(DataOutputStream.java:90) > > > >>>> at > > > >>>> > > > >>>> > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290) > > > >>> > > > >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: > Error > > > >>>> Recovery for block blk_8148813491785406356_5478475 bad datanode[0] > > > >>>> 192.168.100.123:50010 > > > >>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: > Error > > > >>>> Recovery for block blk_8148813491785406356_5478475 in pipeline > > > >>>> 192.168.100.123:50010, 192.168.100.134:50010, > 192.168.100.122:50010 > > : > > > >>>> bad > > > >>>> datanode 192.168.100.123:50010 > > > >>>> 2009-11-13 20:06:37,436 INFO org.apache.hadoop.hdfs.DFSClient: > > > Exception > > > >>>> > > > >>> in > > > >>> > > > >>>> createBlockOutputStream java.net.SocketException: Too many open > > files > > > >>>> 2009-11-13 20:06:37,436 INFO org.apache.hadoop.hdfs.DFSClient: > > > >>>> Abandoning > > > >>>> block blk_2119727700857186236_5478498 > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> On Thu, Nov 12, 2009 at 4:21 PM, Zhenyu Zhong < > > > [email protected] > > > >>>> wrote: > > > >>>> > > > >>>> Will do. > > > >>>>> > > > >>>>> thanks > > > >>>>> zhenyu > > > >>>>> > > > >>>>> > > > >>>>> On Thu, Nov 12, 2009 at 3:33 PM, stack <[email protected]> wrote: > > > >>>>> > > > >>>>> Enable GC logging too on this next run (see hbase-env.sh). Lets > > try > > > >>>>>> > > > >>>>> and > > > >>> > > > >>>> get > > > >>>>>> to the bottom of whats going on. > > > >>>>>> Thanks, > > > >>>>>> St.Ack > > > >>>>>> > > > >>>>>> On Thu, Nov 12, 2009 at 12:29 PM, Zhenyu Zhong < > > > >>>>>> > > > >>>>> [email protected] > > > >>> > > > >>>> wrote: > > > >>>>>>> > > > >>>>>> > > > >>>>>> I can switch the boxes that run zookeeper with the ones that > run > > > >>>>>>> regionservers. > > > >>>>>>> I will see the results later. > > > >>>>>>> > > > >>>>>>> FYI. The node does have the 10 minutes > zookeeper.session.timeout > > > >>>>>>> > > > >>>>>> value > > > >>> > > > >>>> in > > > >>>>>> > > > >>>>>>> place. > > > >>>>>>> > > > >>>>>>> thanks > > > >>>>>>> zhenyu > > > >>>>>>> > > > >>>>>>> On Thu, Nov 12, 2009 at 3:21 PM, stack <[email protected]> > wrote: > > > >>>>>>> > > > >>>>>>> On Thu, Nov 12, 2009 at 11:50 AM, Zhenyu Zhong < > > > >>>>>>>> > > > >>>>>>> [email protected] > > > >>>>>> > > > >>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>>> In my cluster, half of the cluster have 2 disks 400GB each > per > > > >>>>>>>>> > > > >>>>>>>> machine, > > > >>>>>> > > > >>>>>>> and > > > >>>>>>>> > > > >>>>>>>>> half of the cluster have 6 disks per machine. Maybe we > should > > > >>>>>>>>> > > > >>>>>>>> run > > > >>> > > > >>>> zookeeper > > > >>>>>>>>> on the machines with 2 disks and RS on machines with 6 disks? > > > >>>>>>>>> > > > >>>>>>>>> That would make most sense only in the below, it looks like > the > > > >>>>>>>>> > > > >>>>>>>> RS > > > >>> > > > >>>> that > > > >>>>>> > > > >>>>>>> had > > > >>>>>>>> issue had 4 disks? > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> BTW, the 10 minutes zookeeper.session.timeout has been set > > during > > > >>>>>>>>> > > > >>>>>>>> the > > > >>>>>> > > > >>>>>>> experiment. > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>> And for sure this node had it in place? > > > >>>>>>>> St.Ack > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>>> thanks > > > >>>>>>>>> zhenyu > > > >>>>>>>>> > > > >>>>>>>>> On Thu, Nov 12, 2009 at 2:08 PM, stack <[email protected]> > > wrote: > > > >>>>>>>>> > > > >>>>>>>>> On Thu, Nov 12, 2009 at 8:40 AM, Zhenyu Zhong < > > > >>>>>>>>>> > > > >>>>>>>>> [email protected] > > > >>>>>>> > > > >>>>>>>> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Though I experienced 2 regionserver disconnection this > > > >>>>>>>>>>> > > > >>>>>>>>>> morning, > > > >>> > > > >>>> it > > > >>>>>> > > > >>>>>>> looks > > > >>>>>>>>> > > > >>>>>>>>>> better from the regionserver log. (please see the following > > > >>>>>>>>>>> > > > >>>>>>>>>> log) > > > >>> > > > >>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> http://pastebin.com/m496dbfae > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> I checked diskIO for one of the > regionserver(192.168.100.116) > > > >>>>>>>>>>> > > > >>>>>>>>>> during > > > >>>>>>> > > > >>>>>>>> the > > > >>>>>>>>> > > > >>>>>>>>>> time it disconnected. > > > >>>>>>>>>>> > > > >>>>>>>>>>> Time: 03:04:58 AM > > > >>>>>>>>>>> Device: tps Blk_read/s Blk_wrtn/s Blk_read > > > >>>>>>>>>>> > > > >>>>>>>>>> Blk_wrtn > > > >>>>>>>> > > > >>>>>>>>> sda 105.31 5458.83 19503.64 9043873239 > > > >>>>>>>>>>> > > > >>>>>>>>>> 32312473676 > > > >>>>>>>> > > > >>>>>>>>> sda1 2.90 6.64 99.25 10993934 > > > >>>>>>>>>>> > > > >>>>>>>>>> 164433464 > > > >>>>>>>> > > > >>>>>>>>> sda2 1.72 23.76 30.25 39365817 > > > >>>>>>>>>>> > > > >>>>>>>>>> 50124008 > > > >>>>>>>> > > > >>>>>>>>> sda3 0.30 0.38 3.58 624930 > > > >>>>>>>>>>> > > > >>>>>>>>>> 5923000 > > > >>>>>>>> > > > >>>>>>>>> sda4 100.39 5428.06 19370.56 8992888270 > > > >>>>>>>>>>> > > > >>>>>>>>>> 32091993204 > > > >>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> Is this high for you? 20k blocks/second would seem to be > > high > > > >>>>>>>>>> > > > >>>>>>>>> but > > > >>> > > > >>>> its > > > >>>>>>> > > > >>>>>>>> one > > > >>>>>>>>> > > > >>>>>>>>>> disk only and its not being shared by zk anymore so > shouldn't > > > >>>>>>>>>> > > > >>>>>>>>> matter? > > > >>>>>> > > > >>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> I also checked the zookeeper quorum server that the > > > >>>>>>>>>>> > > > >>>>>>>>>> regionserver > > > >>> > > > >>>> tried > > > >>>>>>>> > > > >>>>>>>>> to > > > >>>>>>>>> > > > >>>>>>>>>> connect according to the log. However, I don't see > > > >>>>>>>>>>> > > > >>>>>>>>>> 192.168.100.116 > > > >>>>>> > > > >>>>>>> in > > > >>>>>>> > > > >>>>>>>> the > > > >>>>>>>>> > > > >>>>>>>>>> client list of the zookeeper quorum server that the > > > >>>>>>>>>>> > > > >>>>>>>>>> regionserver > > > >>> > > > >>>> tried > > > >>>>>>>> > > > >>>>>>>>> to > > > >>>>>>>>> > > > >>>>>>>>>> connect. > > > >>>>>>>>>>> Would that be a problem? > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> Is that because the ephemeral node for the regionserver > had > > > >>>>>>>>>> > > > >>>>>>>>> evaporated? > > > >>>>>>> > > > >>>>>>>> Lost > > > >>>>>>>>>> its lease w/ zk by the time you went to look? > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Thu Nov 12 15:04:52 UTC 2009 > > > >>>>>>>>>>> Zookeeper version: 3.2.1-808558, built on 08/27/2009 18:48 > > > >>>>>>>>>>> > > > >>>>>>>>>> GMT > > > >>> > > > >>>> Clients: > > > >>>>>>>>>>> /192.168.100.127:43045[1](queued=0,recved=26,sent=0) > > > >>>>>>>>>>> /192.168.100.131:39091[1](queued=0,recved=964,sent=0) > > > >>>>>>>>>>> /192.168.100.124:35961[1](queued=0,recved=3266,sent=0) > > > >>>>>>>>>>> /192.168.100.123:47935[1](queued=0,recved=233,sent=0) > > > >>>>>>>>>>> /192.168.100.125:46931[1](queued=0,recved=2,sent=0) > > > >>>>>>>>>>> /192.168.100.118:54924[1](queued=0,recved=295,sent=0) > > > >>>>>>>>>>> /192.168.100.118:41390[1](queued=0,recved=2290,sent=0) > > > >>>>>>>>>>> /192.168.100.136:42243[1](queued=0,recved=0,sent=0) > > > >>>>>>>>>>> > > > >>>>>>>>>>> Latency min/avg/max: 0/17/6333 > > > >>>>>>>>>>> Received: 47111 > > > >>>>>>>>>>> Sent: 0 > > > >>>>>>>>>>> Outstanding: 0 > > > >>>>>>>>>>> Zxid: 0x77000083f4 > > > >>>>>>>>>>> Mode: leader > > > >>>>>>>>>>> Node count: 23 > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> That 6 second maximum latency is pretty bad but should be > well > > > >>>>>>>>>> > > > >>>>>>>>> within > > > >>>>>> > > > >>>>>>> the > > > >>>>>>>> > > > >>>>>>>>> zk > > > >>>>>>>>>> session timeout. > > > >>>>>>>>>> > > > >>>>>>>>>> So, problem is likely on the zk client side of the session; > > > >>>>>>>>>> > > > >>>>>>>>> i.e. > > > >>> > > > >>>> in > > > >>>>>> > > > >>>>>>> the > > > >>>>>>> > > > >>>>>>>> RS. > > > >>>>>>>>> > > > >>>>>>>>>> You could enable GC logging as J-D suggested to see if you > > have > > > >>>>>>>>>> > > > >>>>>>>>> any > > > >>>>>> > > > >>>>>>> big > > > >>>>>>> > > > >>>>>>>> pauses, pauses > zk session timeout. > > > >>>>>>>>>> > > > >>>>>>>>>> When the RS went down, it didn't look too heavily loaded: > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> 1. 2009-11-12 15:04:52,830 INFO > > > >>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of > > > >>>>>>>>>> > > > >>>>>>>>> metrics: > > > >>>>>>> > > > >>>>>>>> request=1.5166667, regions=322, stores=657, storefiles=631, > > > >>>>>>>>>> storefileIndexSize=61, memstoreSize=1472, usedHeap=2819, > > > >>>>>>>>>> > > > >>>>>>>>> maxHeap=4079, > > > >>>>>>>> > > > >>>>>>>>> blockCacheSize=658110960, blockCacheFree=197395984, > > > >>>>>>>>>> > > > >>>>>>>>> blockCacheCount=9903, > > > >>>>>>>>> > > > >>>>>>>>>> blockCacheHitRatio=99 > > > >>>>>>>>>> > > > >>>>>>>>>> Its serving a few reads? The number of store files seems > > fine. > > > >>>>>>>>>> > > > >>>>>>>>> Not > > > >>>>>> > > > >>>>>>> too > > > >>>>>>>> > > > >>>>>>>>> much memory used. > > > >>>>>>>>>> > > > >>>>>>>>>> Looking at the logs, I see the Lease Still Held exception. > > > >>>>>>>>>> > > > >>>>>>>>> This > > > >>> > > > >>>> happens > > > >>>>>>>> > > > >>>>>>>>> when the RS does its regular report to the master but the > > > >>>>>>>>>> > > > >>>>>>>>> master > > > >>> > > > >>>> thinks > > > >>>>>>> > > > >>>>>>>> the > > > >>>>>>>>> > > > >>>>>>>>>> RS has since restarted. It'll think this probably because > it > > > >>>>>>>>>> > > > >>>>>>>>> noticed > > > >>>>>> > > > >>>>>>> that > > > >>>>>>>>> > > > >>>>>>>>>> the RS's znode in zk had gone away and it considered the RS > > > >>>>>>>>>> > > > >>>>>>>>> dead. > > > >>> > > > >>>> > > > >>>>>>>>>> Looking too at your logs I see this gap in the zk pinging: > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> 1. 2009-11-12 15:03:39,325 DEBUG > > > >>>>>>>>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn: > > > >>>>>> > > > >>>>>>> Got > > > >>>>>>> > > > >>>>>>>> ping response for sessionid:0x224e55436ad0004 after 0ms > > > >>>>>>>>>> 2. 2009-11-12 15:03:43,113 DEBUG > > > >>>>>>>>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn: > > > >>>>>> > > > >>>>>>> Got > > > >>>>>>> > > > >>>>>>>> ping response for sessionid:0x24e55436a0007d after 0ms > > > >>>>>>>>>> > > > >>>>>>>>>> Where in the lines above it, its reporting about every ten > > > >>>>>>>>>> > > > >>>>>>>>> seconds, > > > >>>>>> > > > >>>>>>> here > > > >>>>>>>> > > > >>>>>>>>> there is a big gap. > > > >>>>>>>>>> > > > >>>>>>>>>> Do you have ganglia or something that will let you look more > > > >>>>>>>>>> > > > >>>>>>>>> into > > > >>> > > > >>>> what > > > >>>>>>> > > > >>>>>>>> was > > > >>>>>>>>> > > > >>>>>>>>>> happening on this machine around this time? Is the machine > > OK? > > > >>>>>>>>>> > > > >>>>>>>>> It > > > >>>>>> > > > >>>>>>> looks > > > >>>>>>>> > > > >>>>>>>>> lightly loaded and you have your cluster nicely laid out. > > > >>>>>>>>>> > > > >>>>>>>>> Something > > > >>>>>> > > > >>>>>>> odd > > > >>>>>>>> > > > >>>>>>>>> is > > > >>>>>>>>> > > > >>>>>>>>>> going on. What about things like the write speed to disk? > In > > > >>>>>>>>>> > > > >>>>>>>>> the > > > >>> > > > >>>> past > > > >>>>>>> > > > >>>>>>>> strange issues have been explained by incorrectly set BIOS > > > >>>>>>>>>> > > > >>>>>>>>> which > > > >>> > > > >>>> made > > > >>>>>> > > > >>>>>>> disks > > > >>>>>>>>> > > > >>>>>>>>>> run at 1/100th of their proper speed. > > > >>>>>>>>>> > > > >>>>>>>>>> St.Ack > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>>> Best, > > > >>>>>>>>>>> zhenyu > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58 PM, Zhenyu Zhong < > > > >>>>>>>>>>> > > > >>>>>>>>>> [email protected] > > > >>>>>>>> > > > >>>>>>>>> wrote: > > > >>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> Stack > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> I am very appreciated for your comments. > > > >>>>>>>>>>>> I will use the zookeeper monitoring script on my cluster > > > >>>>>>>>>>>> > > > >>>>>>>>>>> and > > > >>> > > > >>>> let > > > >>>>>> > > > >>>>>>> it > > > >>>>>>> > > > >>>>>>>> run > > > >>>>>>>>> > > > >>>>>>>>>> overnight to see the result. > > > >>>>>>>>>>>> I will follow up the thread when I get anything. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> thanks > > > >>>>>>>>>>>> zhenyu > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:52 PM, stack <[email protected]> > > > >>>>>>>>>>>> > > > >>>>>>>>>>> wrote: > > > >>>>>> > > > >>>>>>> > > > >>>>>>>>>>>> I see these in your log too: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> 1. 2009-11-11 04:27:19,018 DEBUG > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn: > > > >>>>>>>> > > > >>>>>>>>> Got > > > >>>>>>>>>> > > > >>>>>>>>>>> ping response for sessionid:0x424dfd908c50009 after > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> 4544ms > > > >>> > > > >>>> 2. 2009-11-11 04:27:19,018 DEBUG > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn: > > > >>>>>>>> > > > >>>>>>>>> Got > > > >>>>>>>>>> > > > >>>>>>>>>>> ping response for sessionid:0x24dfd90c810002 after > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> 4548ms > > > >>> > > > >>>> 3. 2009-11-11 04:27:43,960 DEBUG > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn: > > > >>>>>>>> > > > >>>>>>>>> Got > > > >>>>>>>>>> > > > >>>>>>>>>>> ping response for sessionid:0x424dfd908c50009 after > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> 9030ms > > > >>> > > > >>>> 4. 2009-11-11 04:27:43,960 DEBUG > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> org.apache.zookeeper. > > > >>>>>>>> > > > >>>>>>> > > > > > >
