Stack, You are right, the master started with ulimit -n 1024. It doesn't take the system value.
Regarding to the too many open files, it looks like the same as the one J-D put up. But I will get the Master start with higher value first and see if this kind of error goes away. thanks a lot! zhenyu On Fri, Nov 13, 2009 at 6:02 PM, stack <[email protected]> wrote: > Does it say > > ulimit -n 32768 > > ...as the second line in your log file on start of the master? > > Is the stack trace that complains about too many open files same as the one > in the blog post J-D put up? > > St.Ack > > > On Fri, Nov 13, 2009 at 1:37 PM, Zhenyu Zhong <[email protected] > >wrote: > > > The ulimit file descriptors was set to fs.file-max =1578334, also in the > > limits.conf the value was set to 32768. > > So these are way higher than the open descriptors for the running > > processes. > > > > thanks > > zhenyu > > > > > > On Fri, Nov 13, 2009 at 4:33 PM, Stack <[email protected]> wrote: > > > > > You upped the ulimit file descriptors as per the getting started doc? > > > > > > > > > > > > On Nov 13, 2009, at 1:26 PM, Zhenyu Zhong <[email protected]> > > wrote: > > > > > > Thanks a lot. > > >> > > >> > > >> Bad news is my kernel is still 2.6.26. > > >> But it was not a problem before. > > >> > > >> Very strange. > > >> > > >> zhenyu > > >> > > >> On Fri, Nov 13, 2009 at 4:16 PM, Jean-Daniel Cryans < > > [email protected] > > >> >wrote: > > >> > > >> Looks like > > >>> > > >>> > > > http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/ > > >>> > > >>> J-D > > >>> > > >>> On Fri, Nov 13, 2009 at 1:12 PM, Zhenyu Zhong < > [email protected] > > > > > >>> wrote: > > >>> > > >>>> Hi, > > >>>> > > >>>> > > >>>> After I re-organize the cluster, the experiment ran into problem > > faster > > >>>> > > >>> than > > >>> > > >>>> before. > > >>>> > > >>>> Basically the changes are to use machines with less resources as > > >>>> > > >>> zookeeper > > >>> > > >>>> quorums and machines with more resources as regionserver. > > >>>> > > >>>> From the log, I still see the pause around 1 minute. > > >>>> I enabled the GC logging, please see > > >>>> > > >>>> http://pastebin.com/m1d4ce0f1 > > >>>> > > >>>> for details. > > >>>> In general I don't see many pauses in the GC. > > >>>> > > >>>> What is more interesting, 1 hour after the 1st regionserver > > >>>> disconnected, > > >>>> the master log complained about too many open files. This didn't > > happen > > >>>> before. > > >>>> I checked the system OS setup as well as the limits.conf. I also > > checked > > >>>> > > >>> the > > >>> > > >>>> running processes. The total open files don't reach the limit. I am > > >>>> > > >>> confused > > >>> > > >>>> a bit. > > >>>> > > >>>> > > >>>> Please see the following master log. > > >>>> > > >>>> 2009-11-13 20:06:02,114 INFO > > org.apache.hadoop.hbase.master.BaseScanner: > > >>>> RegionManager.metaScanner scan of 4658 row(s) of meta region > {server: > > >>>> 192.168.100.128:60021, regionname: .META.,,1, startKey: <>} > complete > > >>>> 2009-11-13 20:06:02,114 INFO > > org.apache.hadoop.hbase.master.BaseScanner: > > >>>> > > >>> All > > >>> > > >>>> 1 .META. region(s) scanned > > >>>> 2009-11-13 20:06:07,677 DEBUG org.apache.zookeeper.ClientCnxn: Got > > ping > > >>>> response for sessionid:0x424eebf1c10004c after 3ms > > >>>> 2009-11-13 20:06:08,178 INFO org.apache.hadoop.hdfs.DFSClient: > > Exception > > >>>> > > >>> in > > >>> > > >>>> createBlockOutputStream java.io.IOException: Bad connect ack with > > >>>> firstBadLink 192.168.100.123:50010 > > >>>> 2009-11-13 20:06:08,178 INFO org.apache.hadoop.hdfs.DFSClient: > > >>>> Abandoning > > >>>> block blk_-2808245019291145247_5478039 > > >>>> 2009-11-13 20:06:09,682 INFO org.apache.hadoop.hdfs.DFSClient: > > Exception > > >>>> > > >>> in > > >>> > > >>>> createBlockOutputStream java.io.EOFException > > >>>> 2009-11-13 20:06:09,682 INFO org.apache.hadoop.hdfs.DFSClient: > > >>>> Abandoning > > >>>> block blk_1074853606841896259_5478048 > > >>>> 2009-11-13 20:06:10,334 DEBUG org.apache.zookeeper.ClientCnxn: Got > > ping > > >>>> response for sessionid:0x24eebf1043003c after 1ms > > >>>> 2009-11-13 20:06:21,018 DEBUG org.apache.zookeeper.ClientCnxn: Got > > ping > > >>>> response for sessionid:0x424eebf1c10004c after 0ms > > >>>> 2009-11-13 20:06:23,674 DEBUG org.apache.zookeeper.ClientCnxn: Got > > ping > > >>>> response for sessionid:0x24eebf1043003c after 0ms > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: > > Exception > > >>>> > > >>> in > > >>> > > >>>> createBlockOutputStream java.io.IOException: Bad connect ack with > > >>>> firstBadLink 192.168.100.123:50010 > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: > > >>>> Abandoning > > >>>> block blk_-6642544517082142289_5478063 > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: > > Exception > > >>>> > > >>> in > > >>> > > >>>> createBlockOutputStream java.net.SocketException: Too many open > files > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: > > >>>> Abandoning > > >>>> block blk_2057511041109796090_5478063 > > >>>> 2009-11-13 20:06:24,928 INFO org.apache.hadoop.hdfs.DFSClient: > > Exception > > >>>> > > >>> in > > >>> > > >>>> createBlockOutputStream java.net.SocketException: Too many open > files > > >>>> 2009-11-13 20:06:24,928 INFO org.apache.hadoop.hdfs.DFSClient: > > >>>> Abandoning > > >>>> block blk_8219260302213892894_5478064 > > >>>> 2009-11-13 20:06:30,855 INFO org.apache.hadoop.hdfs.DFSClient: > > Exception > > >>>> > > >>> in > > >>> > > >>>> createBlockOutputStream java.net.SocketException: Too many open > files > > >>>> 2009-11-13 20:06:30,855 INFO org.apache.hadoop.hdfs.DFSClient: > > >>>> Abandoning > > >>>> block blk_1669205542853067709_5478235 > > >>>> 2009-11-13 20:06:30,905 INFO org.apache.hadoop.hdfs.DFSClient: > > Exception > > >>>> > > >>> in > > >>> > > >>>> createBlockOutputStream java.net.SocketException: Too many open > files > > >>>> 2009-11-13 20:06:30,905 INFO org.apache.hadoop.hdfs.DFSClient: > > >>>> Abandoning > > >>>> block blk_9128897691346270351_5478237 > > >>>> 2009-11-13 20:06:30,955 INFO org.apache.hadoop.hdfs.DFSClient: > > Exception > > >>>> > > >>> in > > >>> > > >>>> createBlockOutputStream java.net.SocketException: Too many open > files > > >>>> 2009-11-13 20:06:30,955 INFO org.apache.hadoop.hdfs.DFSClient: > > >>>> Abandoning > > >>>> block blk_1116845144864123018_5478240 > > >>>> 2009-11-13 20:06:34,372 DEBUG org.apache.zookeeper.ClientCnxn: Got > > ping > > >>>> response for sessionid:0x424eebf1c10004c after 0ms > > >>>> 2009-11-13 20:06:37,034 DEBUG org.apache.zookeeper.ClientCnxn: Got > > ping > > >>>> response for sessionid:0x24eebf1043003c after 0ms > > >>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: > > >>>> > > >>> DataStreamer > > >>> > > >>>> Exception: java.io.IOException: Too many open files > > >>>> at sun.nio.ch.IOUtil.initPipe(Native Method) > > >>>> at > sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49) > > >>>> at > > >>>> > > >>>> > > > sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18) > > >>> > > >>>> at > > >>>> > > >>>> > > > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407) > > >>> > > >>>> at > > >>>> > > >>>> > > > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322) > > >>> > > >>>> at > > >>>> > > >>>> > > > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > > >>> > > >>>> at > > >>>> > > >>>> > > > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > > >>> > > >>>> at > > >>>> > > >>>> > > > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > > >>> > > >>>> at > > >>>> > > >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) > > >>> > > >>>> at java.io.DataOutputStream.write(DataOutputStream.java:90) > > >>>> at > > >>>> > > >>>> > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290) > > >>> > > >>>> > > >>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: Error > > >>>> Recovery for block blk_8148813491785406356_5478475 bad datanode[0] > > >>>> 192.168.100.123:50010 > > >>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: Error > > >>>> Recovery for block blk_8148813491785406356_5478475 in pipeline > > >>>> 192.168.100.123:50010, 192.168.100.134:50010, 192.168.100.122:50010 > : > > >>>> bad > > >>>> datanode 192.168.100.123:50010 > > >>>> 2009-11-13 20:06:37,436 INFO org.apache.hadoop.hdfs.DFSClient: > > Exception > > >>>> > > >>> in > > >>> > > >>>> createBlockOutputStream java.net.SocketException: Too many open > files > > >>>> 2009-11-13 20:06:37,436 INFO org.apache.hadoop.hdfs.DFSClient: > > >>>> Abandoning > > >>>> block blk_2119727700857186236_5478498 > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> On Thu, Nov 12, 2009 at 4:21 PM, Zhenyu Zhong < > > [email protected] > > >>>> wrote: > > >>>> > > >>>> Will do. > > >>>>> > > >>>>> thanks > > >>>>> zhenyu > > >>>>> > > >>>>> > > >>>>> On Thu, Nov 12, 2009 at 3:33 PM, stack <[email protected]> wrote: > > >>>>> > > >>>>> Enable GC logging too on this next run (see hbase-env.sh). Lets > try > > >>>>>> > > >>>>> and > > >>> > > >>>> get > > >>>>>> to the bottom of whats going on. > > >>>>>> Thanks, > > >>>>>> St.Ack > > >>>>>> > > >>>>>> On Thu, Nov 12, 2009 at 12:29 PM, Zhenyu Zhong < > > >>>>>> > > >>>>> [email protected] > > >>> > > >>>> wrote: > > >>>>>>> > > >>>>>> > > >>>>>> I can switch the boxes that run zookeeper with the ones that run > > >>>>>>> regionservers. > > >>>>>>> I will see the results later. > > >>>>>>> > > >>>>>>> FYI. The node does have the 10 minutes zookeeper.session.timeout > > >>>>>>> > > >>>>>> value > > >>> > > >>>> in > > >>>>>> > > >>>>>>> place. > > >>>>>>> > > >>>>>>> thanks > > >>>>>>> zhenyu > > >>>>>>> > > >>>>>>> On Thu, Nov 12, 2009 at 3:21 PM, stack <[email protected]> wrote: > > >>>>>>> > > >>>>>>> On Thu, Nov 12, 2009 at 11:50 AM, Zhenyu Zhong < > > >>>>>>>> > > >>>>>>> [email protected] > > >>>>>> > > >>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>>> In my cluster, half of the cluster have 2 disks 400GB each per > > >>>>>>>>> > > >>>>>>>> machine, > > >>>>>> > > >>>>>>> and > > >>>>>>>> > > >>>>>>>>> half of the cluster have 6 disks per machine. Maybe we should > > >>>>>>>>> > > >>>>>>>> run > > >>> > > >>>> zookeeper > > >>>>>>>>> on the machines with 2 disks and RS on machines with 6 disks? > > >>>>>>>>> > > >>>>>>>>> That would make most sense only in the below, it looks like the > > >>>>>>>>> > > >>>>>>>> RS > > >>> > > >>>> that > > >>>>>> > > >>>>>>> had > > >>>>>>>> issue had 4 disks? > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> BTW, the 10 minutes zookeeper.session.timeout has been set > during > > >>>>>>>>> > > >>>>>>>> the > > >>>>>> > > >>>>>>> experiment. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>> And for sure this node had it in place? > > >>>>>>>> St.Ack > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>>> thanks > > >>>>>>>>> zhenyu > > >>>>>>>>> > > >>>>>>>>> On Thu, Nov 12, 2009 at 2:08 PM, stack <[email protected]> > wrote: > > >>>>>>>>> > > >>>>>>>>> On Thu, Nov 12, 2009 at 8:40 AM, Zhenyu Zhong < > > >>>>>>>>>> > > >>>>>>>>> [email protected] > > >>>>>>> > > >>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> Though I experienced 2 regionserver disconnection this > > >>>>>>>>>>> > > >>>>>>>>>> morning, > > >>> > > >>>> it > > >>>>>> > > >>>>>>> looks > > >>>>>>>>> > > >>>>>>>>>> better from the regionserver log. (please see the following > > >>>>>>>>>>> > > >>>>>>>>>> log) > > >>> > > >>>> > > >>>>>>>>>>> > > >>>>>>>>>>> http://pastebin.com/m496dbfae > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> I checked diskIO for one of the regionserver(192.168.100.116) > > >>>>>>>>>>> > > >>>>>>>>>> during > > >>>>>>> > > >>>>>>>> the > > >>>>>>>>> > > >>>>>>>>>> time it disconnected. > > >>>>>>>>>>> > > >>>>>>>>>>> Time: 03:04:58 AM > > >>>>>>>>>>> Device: tps Blk_read/s Blk_wrtn/s Blk_read > > >>>>>>>>>>> > > >>>>>>>>>> Blk_wrtn > > >>>>>>>> > > >>>>>>>>> sda 105.31 5458.83 19503.64 9043873239 > > >>>>>>>>>>> > > >>>>>>>>>> 32312473676 > > >>>>>>>> > > >>>>>>>>> sda1 2.90 6.64 99.25 10993934 > > >>>>>>>>>>> > > >>>>>>>>>> 164433464 > > >>>>>>>> > > >>>>>>>>> sda2 1.72 23.76 30.25 39365817 > > >>>>>>>>>>> > > >>>>>>>>>> 50124008 > > >>>>>>>> > > >>>>>>>>> sda3 0.30 0.38 3.58 624930 > > >>>>>>>>>>> > > >>>>>>>>>> 5923000 > > >>>>>>>> > > >>>>>>>>> sda4 100.39 5428.06 19370.56 8992888270 > > >>>>>>>>>>> > > >>>>>>>>>> 32091993204 > > >>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Is this high for you? 20k blocks/second would seem to be > high > > >>>>>>>>>> > > >>>>>>>>> but > > >>> > > >>>> its > > >>>>>>> > > >>>>>>>> one > > >>>>>>>>> > > >>>>>>>>>> disk only and its not being shared by zk anymore so shouldn't > > >>>>>>>>>> > > >>>>>>>>> matter? > > >>>>>> > > >>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> I also checked the zookeeper quorum server that the > > >>>>>>>>>>> > > >>>>>>>>>> regionserver > > >>> > > >>>> tried > > >>>>>>>> > > >>>>>>>>> to > > >>>>>>>>> > > >>>>>>>>>> connect according to the log. However, I don't see > > >>>>>>>>>>> > > >>>>>>>>>> 192.168.100.116 > > >>>>>> > > >>>>>>> in > > >>>>>>> > > >>>>>>>> the > > >>>>>>>>> > > >>>>>>>>>> client list of the zookeeper quorum server that the > > >>>>>>>>>>> > > >>>>>>>>>> regionserver > > >>> > > >>>> tried > > >>>>>>>> > > >>>>>>>>> to > > >>>>>>>>> > > >>>>>>>>>> connect. > > >>>>>>>>>>> Would that be a problem? > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Is that because the ephemeral node for the regionserver had > > >>>>>>>>>> > > >>>>>>>>> evaporated? > > >>>>>>> > > >>>>>>>> Lost > > >>>>>>>>>> its lease w/ zk by the time you went to look? > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> Thu Nov 12 15:04:52 UTC 2009 > > >>>>>>>>>>> Zookeeper version: 3.2.1-808558, built on 08/27/2009 18:48 > > >>>>>>>>>>> > > >>>>>>>>>> GMT > > >>> > > >>>> Clients: > > >>>>>>>>>>> /192.168.100.127:43045[1](queued=0,recved=26,sent=0) > > >>>>>>>>>>> /192.168.100.131:39091[1](queued=0,recved=964,sent=0) > > >>>>>>>>>>> /192.168.100.124:35961[1](queued=0,recved=3266,sent=0) > > >>>>>>>>>>> /192.168.100.123:47935[1](queued=0,recved=233,sent=0) > > >>>>>>>>>>> /192.168.100.125:46931[1](queued=0,recved=2,sent=0) > > >>>>>>>>>>> /192.168.100.118:54924[1](queued=0,recved=295,sent=0) > > >>>>>>>>>>> /192.168.100.118:41390[1](queued=0,recved=2290,sent=0) > > >>>>>>>>>>> /192.168.100.136:42243[1](queued=0,recved=0,sent=0) > > >>>>>>>>>>> > > >>>>>>>>>>> Latency min/avg/max: 0/17/6333 > > >>>>>>>>>>> Received: 47111 > > >>>>>>>>>>> Sent: 0 > > >>>>>>>>>>> Outstanding: 0 > > >>>>>>>>>>> Zxid: 0x77000083f4 > > >>>>>>>>>>> Mode: leader > > >>>>>>>>>>> Node count: 23 > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> That 6 second maximum latency is pretty bad but should be well > > >>>>>>>>>> > > >>>>>>>>> within > > >>>>>> > > >>>>>>> the > > >>>>>>>> > > >>>>>>>>> zk > > >>>>>>>>>> session timeout. > > >>>>>>>>>> > > >>>>>>>>>> So, problem is likely on the zk client side of the session; > > >>>>>>>>>> > > >>>>>>>>> i.e. > > >>> > > >>>> in > > >>>>>> > > >>>>>>> the > > >>>>>>> > > >>>>>>>> RS. > > >>>>>>>>> > > >>>>>>>>>> You could enable GC logging as J-D suggested to see if you > have > > >>>>>>>>>> > > >>>>>>>>> any > > >>>>>> > > >>>>>>> big > > >>>>>>> > > >>>>>>>> pauses, pauses > zk session timeout. > > >>>>>>>>>> > > >>>>>>>>>> When the RS went down, it didn't look too heavily loaded: > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> 1. 2009-11-12 15:04:52,830 INFO > > >>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of > > >>>>>>>>>> > > >>>>>>>>> metrics: > > >>>>>>> > > >>>>>>>> request=1.5166667, regions=322, stores=657, storefiles=631, > > >>>>>>>>>> storefileIndexSize=61, memstoreSize=1472, usedHeap=2819, > > >>>>>>>>>> > > >>>>>>>>> maxHeap=4079, > > >>>>>>>> > > >>>>>>>>> blockCacheSize=658110960, blockCacheFree=197395984, > > >>>>>>>>>> > > >>>>>>>>> blockCacheCount=9903, > > >>>>>>>>> > > >>>>>>>>>> blockCacheHitRatio=99 > > >>>>>>>>>> > > >>>>>>>>>> Its serving a few reads? The number of store files seems > fine. > > >>>>>>>>>> > > >>>>>>>>> Not > > >>>>>> > > >>>>>>> too > > >>>>>>>> > > >>>>>>>>> much memory used. > > >>>>>>>>>> > > >>>>>>>>>> Looking at the logs, I see the Lease Still Held exception. > > >>>>>>>>>> > > >>>>>>>>> This > > >>> > > >>>> happens > > >>>>>>>> > > >>>>>>>>> when the RS does its regular report to the master but the > > >>>>>>>>>> > > >>>>>>>>> master > > >>> > > >>>> thinks > > >>>>>>> > > >>>>>>>> the > > >>>>>>>>> > > >>>>>>>>>> RS has since restarted. It'll think this probably because it > > >>>>>>>>>> > > >>>>>>>>> noticed > > >>>>>> > > >>>>>>> that > > >>>>>>>>> > > >>>>>>>>>> the RS's znode in zk had gone away and it considered the RS > > >>>>>>>>>> > > >>>>>>>>> dead. > > >>> > > >>>> > > >>>>>>>>>> Looking too at your logs I see this gap in the zk pinging: > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> 1. 2009-11-12 15:03:39,325 DEBUG > > >>>>>>>>>> > > >>>>>>>>> org.apache.zookeeper.ClientCnxn: > > >>>>>> > > >>>>>>> Got > > >>>>>>> > > >>>>>>>> ping response for sessionid:0x224e55436ad0004 after 0ms > > >>>>>>>>>> 2. 2009-11-12 15:03:43,113 DEBUG > > >>>>>>>>>> > > >>>>>>>>> org.apache.zookeeper.ClientCnxn: > > >>>>>> > > >>>>>>> Got > > >>>>>>> > > >>>>>>>> ping response for sessionid:0x24e55436a0007d after 0ms > > >>>>>>>>>> > > >>>>>>>>>> Where in the lines above it, its reporting about every ten > > >>>>>>>>>> > > >>>>>>>>> seconds, > > >>>>>> > > >>>>>>> here > > >>>>>>>> > > >>>>>>>>> there is a big gap. > > >>>>>>>>>> > > >>>>>>>>>> Do you have ganglia or something that will let you look more > > >>>>>>>>>> > > >>>>>>>>> into > > >>> > > >>>> what > > >>>>>>> > > >>>>>>>> was > > >>>>>>>>> > > >>>>>>>>>> happening on this machine around this time? Is the machine > OK? > > >>>>>>>>>> > > >>>>>>>>> It > > >>>>>> > > >>>>>>> looks > > >>>>>>>> > > >>>>>>>>> lightly loaded and you have your cluster nicely laid out. > > >>>>>>>>>> > > >>>>>>>>> Something > > >>>>>> > > >>>>>>> odd > > >>>>>>>> > > >>>>>>>>> is > > >>>>>>>>> > > >>>>>>>>>> going on. What about things like the write speed to disk? In > > >>>>>>>>>> > > >>>>>>>>> the > > >>> > > >>>> past > > >>>>>>> > > >>>>>>>> strange issues have been explained by incorrectly set BIOS > > >>>>>>>>>> > > >>>>>>>>> which > > >>> > > >>>> made > > >>>>>> > > >>>>>>> disks > > >>>>>>>>> > > >>>>>>>>>> run at 1/100th of their proper speed. > > >>>>>>>>>> > > >>>>>>>>>> St.Ack > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> Best, > > >>>>>>>>>>> zhenyu > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58 PM, Zhenyu Zhong < > > >>>>>>>>>>> > > >>>>>>>>>> [email protected] > > >>>>>>>> > > >>>>>>>>> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Stack > > >>>>>>>>>>>> > > >>>>>>>>>>>> I am very appreciated for your comments. > > >>>>>>>>>>>> I will use the zookeeper monitoring script on my cluster > > >>>>>>>>>>>> > > >>>>>>>>>>> and > > >>> > > >>>> let > > >>>>>> > > >>>>>>> it > > >>>>>>> > > >>>>>>>> run > > >>>>>>>>> > > >>>>>>>>>> overnight to see the result. > > >>>>>>>>>>>> I will follow up the thread when I get anything. > > >>>>>>>>>>>> > > >>>>>>>>>>>> thanks > > >>>>>>>>>>>> zhenyu > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:52 PM, stack <[email protected]> > > >>>>>>>>>>>> > > >>>>>>>>>>> wrote: > > >>>>>> > > >>>>>>> > > >>>>>>>>>>>> I see these in your log too: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> 1. 2009-11-11 04:27:19,018 DEBUG > > >>>>>>>>>>>>> > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn: > > >>>>>>>> > > >>>>>>>>> Got > > >>>>>>>>>> > > >>>>>>>>>>> ping response for sessionid:0x424dfd908c50009 after > > >>>>>>>>>>>>> > > >>>>>>>>>>>> 4544ms > > >>> > > >>>> 2. 2009-11-11 04:27:19,018 DEBUG > > >>>>>>>>>>>>> > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn: > > >>>>>>>> > > >>>>>>>>> Got > > >>>>>>>>>> > > >>>>>>>>>>> ping response for sessionid:0x24dfd90c810002 after > > >>>>>>>>>>>>> > > >>>>>>>>>>>> 4548ms > > >>> > > >>>> 3. 2009-11-11 04:27:43,960 DEBUG > > >>>>>>>>>>>>> > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn: > > >>>>>>>> > > >>>>>>>>> Got > > >>>>>>>>>> > > >>>>>>>>>>> ping response for sessionid:0x424dfd908c50009 after > > >>>>>>>>>>>>> > > >>>>>>>>>>>> 9030ms > > >>> > > >>>> 4. 2009-11-11 04:27:43,960 DEBUG > > >>>>>>>>>>>>> > > >>>>>>>>>>>> org.apache.zookeeper. > > >>>>>>>> > > >>>>>>> > > >
