AFAIK there's no requirement on the kernel version, 2.6.27 is the only one that had that restriction and it was gone in the next version.
We do have a requirement for java6 per http://hadoop.apache.org/hbase/docs/r0.20.1/api/overview-summary.html#overview_description J-D On Fri, Nov 13, 2009 at 1:58 PM, Zhenyu Zhong <[email protected]> wrote: > Does HBase/Hadoop have to run on kernel 2.6.27 or jdk 1.6? It seems only > that kernel provides epoll resource configuration. > > This is the first time I saw this error when I use machines with less > resource for zookeeper. > > Probably, I should change it back. > > > zhenyu > > > On Fri, Nov 13, 2009 at 4:37 PM, Zhenyu Zhong <[email protected]>wrote: > >> The ulimit file descriptors was set to fs.file-max =1578334, also in the >> limits.conf the value was set to 32768. >> So these are way higher than the open descriptors for the running >> processes. >> >> thanks >> zhenyu >> >> >> >> On Fri, Nov 13, 2009 at 4:33 PM, Stack <[email protected]> wrote: >> >>> You upped the ulimit file descriptors as per the getting started doc? >>> >>> >>> >>> On Nov 13, 2009, at 1:26 PM, Zhenyu Zhong <[email protected]> >>> wrote: >>> >>> Thanks a lot. >>>> >>>> >>>> Bad news is my kernel is still 2.6.26. >>>> But it was not a problem before. >>>> >>>> Very strange. >>>> >>>> zhenyu >>>> >>>> On Fri, Nov 13, 2009 at 4:16 PM, Jean-Daniel Cryans <[email protected] >>>> >wrote: >>>> >>>> Looks like >>>>> >>>>> http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/ >>>>> >>>>> J-D >>>>> >>>>> On Fri, Nov 13, 2009 at 1:12 PM, Zhenyu Zhong <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> >>>>>> After I re-organize the cluster, the experiment ran into problem faster >>>>>> >>>>> than >>>>> >>>>>> before. >>>>>> >>>>>> Basically the changes are to use machines with less resources as >>>>>> >>>>> zookeeper >>>>> >>>>>> quorums and machines with more resources as regionserver. >>>>>> >>>>>> From the log, I still see the pause around 1 minute. >>>>>> I enabled the GC logging, please see >>>>>> >>>>>> http://pastebin.com/m1d4ce0f1 >>>>>> >>>>>> for details. >>>>>> In general I don't see many pauses in the GC. >>>>>> >>>>>> What is more interesting, 1 hour after the 1st regionserver >>>>>> disconnected, >>>>>> the master log complained about too many open files. This didn't happen >>>>>> before. >>>>>> I checked the system OS setup as well as the limits.conf. I also >>>>>> checked >>>>>> >>>>> the >>>>> >>>>>> running processes. The total open files don't reach the limit. I am >>>>>> >>>>> confused >>>>> >>>>>> a bit. >>>>>> >>>>>> >>>>>> Please see the following master log. >>>>>> >>>>>> 2009-11-13 20:06:02,114 INFO >>>>>> org.apache.hadoop.hbase.master.BaseScanner: >>>>>> RegionManager.metaScanner scan of 4658 row(s) of meta region {server: >>>>>> 192.168.100.128:60021, regionname: .META.,,1, startKey: <>} complete >>>>>> 2009-11-13 20:06:02,114 INFO >>>>>> org.apache.hadoop.hbase.master.BaseScanner: >>>>>> >>>>> All >>>>> >>>>>> 1 .META. region(s) scanned >>>>>> 2009-11-13 20:06:07,677 DEBUG org.apache.zookeeper.ClientCnxn: Got ping >>>>>> response for sessionid:0x424eebf1c10004c after 3ms >>>>>> 2009-11-13 20:06:08,178 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Exception >>>>>> >>>>> in >>>>> >>>>>> createBlockOutputStream java.io.IOException: Bad connect ack with >>>>>> firstBadLink 192.168.100.123:50010 >>>>>> 2009-11-13 20:06:08,178 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Abandoning >>>>>> block blk_-2808245019291145247_5478039 >>>>>> 2009-11-13 20:06:09,682 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Exception >>>>>> >>>>> in >>>>> >>>>>> createBlockOutputStream java.io.EOFException >>>>>> 2009-11-13 20:06:09,682 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Abandoning >>>>>> block blk_1074853606841896259_5478048 >>>>>> 2009-11-13 20:06:10,334 DEBUG org.apache.zookeeper.ClientCnxn: Got ping >>>>>> response for sessionid:0x24eebf1043003c after 1ms >>>>>> 2009-11-13 20:06:21,018 DEBUG org.apache.zookeeper.ClientCnxn: Got ping >>>>>> response for sessionid:0x424eebf1c10004c after 0ms >>>>>> 2009-11-13 20:06:23,674 DEBUG org.apache.zookeeper.ClientCnxn: Got ping >>>>>> response for sessionid:0x24eebf1043003c after 0ms >>>>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Exception >>>>>> >>>>> in >>>>> >>>>>> createBlockOutputStream java.io.IOException: Bad connect ack with >>>>>> firstBadLink 192.168.100.123:50010 >>>>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Abandoning >>>>>> block blk_-6642544517082142289_5478063 >>>>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Exception >>>>>> >>>>> in >>>>> >>>>>> createBlockOutputStream java.net.SocketException: Too many open files >>>>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Abandoning >>>>>> block blk_2057511041109796090_5478063 >>>>>> 2009-11-13 20:06:24,928 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Exception >>>>>> >>>>> in >>>>> >>>>>> createBlockOutputStream java.net.SocketException: Too many open files >>>>>> 2009-11-13 20:06:24,928 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Abandoning >>>>>> block blk_8219260302213892894_5478064 >>>>>> 2009-11-13 20:06:30,855 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Exception >>>>>> >>>>> in >>>>> >>>>>> createBlockOutputStream java.net.SocketException: Too many open files >>>>>> 2009-11-13 20:06:30,855 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Abandoning >>>>>> block blk_1669205542853067709_5478235 >>>>>> 2009-11-13 20:06:30,905 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Exception >>>>>> >>>>> in >>>>> >>>>>> createBlockOutputStream java.net.SocketException: Too many open files >>>>>> 2009-11-13 20:06:30,905 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Abandoning >>>>>> block blk_9128897691346270351_5478237 >>>>>> 2009-11-13 20:06:30,955 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Exception >>>>>> >>>>> in >>>>> >>>>>> createBlockOutputStream java.net.SocketException: Too many open files >>>>>> 2009-11-13 20:06:30,955 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Abandoning >>>>>> block blk_1116845144864123018_5478240 >>>>>> 2009-11-13 20:06:34,372 DEBUG org.apache.zookeeper.ClientCnxn: Got ping >>>>>> response for sessionid:0x424eebf1c10004c after 0ms >>>>>> 2009-11-13 20:06:37,034 DEBUG org.apache.zookeeper.ClientCnxn: Got ping >>>>>> response for sessionid:0x24eebf1043003c after 0ms >>>>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: >>>>>> >>>>> DataStreamer >>>>> >>>>>> Exception: java.io.IOException: Too many open files >>>>>> at sun.nio.ch.IOUtil.initPipe(Native Method) >>>>>> at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49) >>>>>> at >>>>>> >>>>>> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18) >>>>> >>>>>> at >>>>>> >>>>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407) >>>>> >>>>>> at >>>>>> >>>>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322) >>>>> >>>>>> at >>>>>> >>>>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) >>>>> >>>>>> at >>>>>> >>>>>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) >>>>> >>>>>> at >>>>>> >>>>>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) >>>>> >>>>>> at >>>>>> >>>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) >>>>> >>>>>> at java.io.DataOutputStream.write(DataOutputStream.java:90) >>>>>> at >>>>>> >>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290) >>>>> >>>>>> >>>>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: Error >>>>>> Recovery for block blk_8148813491785406356_5478475 bad datanode[0] >>>>>> 192.168.100.123:50010 >>>>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: Error >>>>>> Recovery for block blk_8148813491785406356_5478475 in pipeline >>>>>> 192.168.100.123:50010, 192.168.100.134:50010, 192.168.100.122:50010: >>>>>> bad >>>>>> datanode 192.168.100.123:50010 >>>>>> 2009-11-13 20:06:37,436 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Exception >>>>>> >>>>> in >>>>> >>>>>> createBlockOutputStream java.net.SocketException: Too many open files >>>>>> 2009-11-13 20:06:37,436 INFO org.apache.hadoop.hdfs.DFSClient: >>>>>> Abandoning >>>>>> block blk_2119727700857186236_5478498 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Nov 12, 2009 at 4:21 PM, Zhenyu Zhong <[email protected] >>>>>> wrote: >>>>>> >>>>>> Will do. >>>>>>> >>>>>>> thanks >>>>>>> zhenyu >>>>>>> >>>>>>> >>>>>>> On Thu, Nov 12, 2009 at 3:33 PM, stack <[email protected]> wrote: >>>>>>> >>>>>>> Enable GC logging too on this next run (see hbase-env.sh). Lets try >>>>>>>> >>>>>>> and >>>>> >>>>>> get >>>>>>>> to the bottom of whats going on. >>>>>>>> Thanks, >>>>>>>> St.Ack >>>>>>>> >>>>>>>> On Thu, Nov 12, 2009 at 12:29 PM, Zhenyu Zhong < >>>>>>>> >>>>>>> [email protected] >>>>> >>>>>> wrote: >>>>>>>>> >>>>>>>> >>>>>>>> I can switch the boxes that run zookeeper with the ones that run >>>>>>>>> regionservers. >>>>>>>>> I will see the results later. >>>>>>>>> >>>>>>>>> FYI. The node does have the 10 minutes zookeeper.session.timeout >>>>>>>>> >>>>>>>> value >>>>> >>>>>> in >>>>>>>> >>>>>>>>> place. >>>>>>>>> >>>>>>>>> thanks >>>>>>>>> zhenyu >>>>>>>>> >>>>>>>>> On Thu, Nov 12, 2009 at 3:21 PM, stack <[email protected]> wrote: >>>>>>>>> >>>>>>>>> On Thu, Nov 12, 2009 at 11:50 AM, Zhenyu Zhong < >>>>>>>>>> >>>>>>>>> [email protected] >>>>>>>> >>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> In my cluster, half of the cluster have 2 disks 400GB each per >>>>>>>>>>> >>>>>>>>>> machine, >>>>>>>> >>>>>>>>> and >>>>>>>>>> >>>>>>>>>>> half of the cluster have 6 disks per machine. Maybe we should >>>>>>>>>>> >>>>>>>>>> run >>>>> >>>>>> zookeeper >>>>>>>>>>> on the machines with 2 disks and RS on machines with 6 disks? >>>>>>>>>>> >>>>>>>>>>> That would make most sense only in the below, it looks like the >>>>>>>>>>> >>>>>>>>>> RS >>>>> >>>>>> that >>>>>>>> >>>>>>>>> had >>>>>>>>>> issue had 4 disks? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> BTW, the 10 minutes zookeeper.session.timeout has been set during >>>>>>>>>>> >>>>>>>>>> the >>>>>>>> >>>>>>>>> experiment. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> And for sure this node had it in place? >>>>>>>>>> St.Ack >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> thanks >>>>>>>>>>> zhenyu >>>>>>>>>>> >>>>>>>>>>> On Thu, Nov 12, 2009 at 2:08 PM, stack <[email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>> On Thu, Nov 12, 2009 at 8:40 AM, Zhenyu Zhong < >>>>>>>>>>>> >>>>>>>>>>> [email protected] >>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Though I experienced 2 regionserver disconnection this >>>>>>>>>>>>> >>>>>>>>>>>> morning, >>>>> >>>>>> it >>>>>>>> >>>>>>>>> looks >>>>>>>>>>> >>>>>>>>>>>> better from the regionserver log. (please see the following >>>>>>>>>>>>> >>>>>>>>>>>> log) >>>>> >>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> http://pastebin.com/m496dbfae >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I checked diskIO for one of the regionserver(192.168.100.116) >>>>>>>>>>>>> >>>>>>>>>>>> during >>>>>>>>> >>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>>> time it disconnected. >>>>>>>>>>>>> >>>>>>>>>>>>> Time: 03:04:58 AM >>>>>>>>>>>>> Device: tps Blk_read/s Blk_wrtn/s Blk_read >>>>>>>>>>>>> >>>>>>>>>>>> Blk_wrtn >>>>>>>>>> >>>>>>>>>>> sda 105.31 5458.83 19503.64 9043873239 >>>>>>>>>>>>> >>>>>>>>>>>> 32312473676 >>>>>>>>>> >>>>>>>>>>> sda1 2.90 6.64 99.25 10993934 >>>>>>>>>>>>> >>>>>>>>>>>> 164433464 >>>>>>>>>> >>>>>>>>>>> sda2 1.72 23.76 30.25 39365817 >>>>>>>>>>>>> >>>>>>>>>>>> 50124008 >>>>>>>>>> >>>>>>>>>>> sda3 0.30 0.38 3.58 624930 >>>>>>>>>>>>> >>>>>>>>>>>> 5923000 >>>>>>>>>> >>>>>>>>>>> sda4 100.39 5428.06 19370.56 8992888270 >>>>>>>>>>>>> >>>>>>>>>>>> 32091993204 >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Is this high for you? 20k blocks/second would seem to be high >>>>>>>>>>>> >>>>>>>>>>> but >>>>> >>>>>> its >>>>>>>>> >>>>>>>>>> one >>>>>>>>>>> >>>>>>>>>>>> disk only and its not being shared by zk anymore so shouldn't >>>>>>>>>>>> >>>>>>>>>>> matter? >>>>>>>> >>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I also checked the zookeeper quorum server that the >>>>>>>>>>>>> >>>>>>>>>>>> regionserver >>>>> >>>>>> tried >>>>>>>>>> >>>>>>>>>>> to >>>>>>>>>>> >>>>>>>>>>>> connect according to the log. However, I don't see >>>>>>>>>>>>> >>>>>>>>>>>> 192.168.100.116 >>>>>>>> >>>>>>>>> in >>>>>>>>> >>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>>> client list of the zookeeper quorum server that the >>>>>>>>>>>>> >>>>>>>>>>>> regionserver >>>>> >>>>>> tried >>>>>>>>>> >>>>>>>>>>> to >>>>>>>>>>> >>>>>>>>>>>> connect. >>>>>>>>>>>>> Would that be a problem? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Is that because the ephemeral node for the regionserver had >>>>>>>>>>>> >>>>>>>>>>> evaporated? >>>>>>>>> >>>>>>>>>> Lost >>>>>>>>>>>> its lease w/ zk by the time you went to look? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thu Nov 12 15:04:52 UTC 2009 >>>>>>>>>>>>> Zookeeper version: 3.2.1-808558, built on 08/27/2009 18:48 >>>>>>>>>>>>> >>>>>>>>>>>> GMT >>>>> >>>>>> Clients: >>>>>>>>>>>>> /192.168.100.127:43045[1](queued=0,recved=26,sent=0) >>>>>>>>>>>>> /192.168.100.131:39091[1](queued=0,recved=964,sent=0) >>>>>>>>>>>>> /192.168.100.124:35961[1](queued=0,recved=3266,sent=0) >>>>>>>>>>>>> /192.168.100.123:47935[1](queued=0,recved=233,sent=0) >>>>>>>>>>>>> /192.168.100.125:46931[1](queued=0,recved=2,sent=0) >>>>>>>>>>>>> /192.168.100.118:54924[1](queued=0,recved=295,sent=0) >>>>>>>>>>>>> /192.168.100.118:41390[1](queued=0,recved=2290,sent=0) >>>>>>>>>>>>> /192.168.100.136:42243[1](queued=0,recved=0,sent=0) >>>>>>>>>>>>> >>>>>>>>>>>>> Latency min/avg/max: 0/17/6333 >>>>>>>>>>>>> Received: 47111 >>>>>>>>>>>>> Sent: 0 >>>>>>>>>>>>> Outstanding: 0 >>>>>>>>>>>>> Zxid: 0x77000083f4 >>>>>>>>>>>>> Mode: leader >>>>>>>>>>>>> Node count: 23 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> That 6 second maximum latency is pretty bad but should be well >>>>>>>>>>>> >>>>>>>>>>> within >>>>>>>> >>>>>>>>> the >>>>>>>>>> >>>>>>>>>>> zk >>>>>>>>>>>> session timeout. >>>>>>>>>>>> >>>>>>>>>>>> So, problem is likely on the zk client side of the session; >>>>>>>>>>>> >>>>>>>>>>> i.e. >>>>> >>>>>> in >>>>>>>> >>>>>>>>> the >>>>>>>>> >>>>>>>>>> RS. >>>>>>>>>>> >>>>>>>>>>>> You could enable GC logging as J-D suggested to see if you have >>>>>>>>>>>> >>>>>>>>>>> any >>>>>>>> >>>>>>>>> big >>>>>>>>> >>>>>>>>>> pauses, pauses > zk session timeout. >>>>>>>>>>>> >>>>>>>>>>>> When the RS went down, it didn't look too heavily loaded: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 1. 2009-11-12 15:04:52,830 INFO >>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of >>>>>>>>>>>> >>>>>>>>>>> metrics: >>>>>>>>> >>>>>>>>>> request=1.5166667, regions=322, stores=657, storefiles=631, >>>>>>>>>>>> storefileIndexSize=61, memstoreSize=1472, usedHeap=2819, >>>>>>>>>>>> >>>>>>>>>>> maxHeap=4079, >>>>>>>>>> >>>>>>>>>>> blockCacheSize=658110960, blockCacheFree=197395984, >>>>>>>>>>>> >>>>>>>>>>> blockCacheCount=9903, >>>>>>>>>>> >>>>>>>>>>>> blockCacheHitRatio=99 >>>>>>>>>>>> >>>>>>>>>>>> Its serving a few reads? The number of store files seems fine. >>>>>>>>>>>> >>>>>>>>>>> Not >>>>>>>> >>>>>>>>> too >>>>>>>>>> >>>>>>>>>>> much memory used. >>>>>>>>>>>> >>>>>>>>>>>> Looking at the logs, I see the Lease Still Held exception. >>>>>>>>>>>> >>>>>>>>>>> This >>>>> >>>>>> happens >>>>>>>>>> >>>>>>>>>>> when the RS does its regular report to the master but the >>>>>>>>>>>> >>>>>>>>>>> master >>>>> >>>>>> thinks >>>>>>>>> >>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>>> RS has since restarted. It'll think this probably because it >>>>>>>>>>>> >>>>>>>>>>> noticed >>>>>>>> >>>>>>>>> that >>>>>>>>>>> >>>>>>>>>>>> the RS's znode in zk had gone away and it considered the RS >>>>>>>>>>>> >>>>>>>>>>> dead. >>>>> >>>>>> >>>>>>>>>>>> Looking too at your logs I see this gap in the zk pinging: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 1. 2009-11-12 15:03:39,325 DEBUG >>>>>>>>>>>> >>>>>>>>>>> org.apache.zookeeper.ClientCnxn: >>>>>>>> >>>>>>>>> Got >>>>>>>>> >>>>>>>>>> ping response for sessionid:0x224e55436ad0004 after 0ms >>>>>>>>>>>> 2. 2009-11-12 15:03:43,113 DEBUG >>>>>>>>>>>> >>>>>>>>>>> org.apache.zookeeper.ClientCnxn: >>>>>>>> >>>>>>>>> Got >>>>>>>>> >>>>>>>>>> ping response for sessionid:0x24e55436a0007d after 0ms >>>>>>>>>>>> >>>>>>>>>>>> Where in the lines above it, its reporting about every ten >>>>>>>>>>>> >>>>>>>>>>> seconds, >>>>>>>> >>>>>>>>> here >>>>>>>>>> >>>>>>>>>>> there is a big gap. >>>>>>>>>>>> >>>>>>>>>>>> Do you have ganglia or something that will let you look more >>>>>>>>>>>> >>>>>>>>>>> into >>>>> >>>>>> what >>>>>>>>> >>>>>>>>>> was >>>>>>>>>>> >>>>>>>>>>>> happening on this machine around this time? Is the machine OK? >>>>>>>>>>>> >>>>>>>>>>> It >>>>>>>> >>>>>>>>> looks >>>>>>>>>> >>>>>>>>>>> lightly loaded and you have your cluster nicely laid out. >>>>>>>>>>>> >>>>>>>>>>> Something >>>>>>>> >>>>>>>>> odd >>>>>>>>>> >>>>>>>>>>> is >>>>>>>>>>> >>>>>>>>>>>> going on. What about things like the write speed to disk? In >>>>>>>>>>>> >>>>>>>>>>> the >>>>> >>>>>> past >>>>>>>>> >>>>>>>>>> strange issues have been explained by incorrectly set BIOS >>>>>>>>>>>> >>>>>>>>>>> which >>>>> >>>>>> made >>>>>>>> >>>>>>>>> disks >>>>>>>>>>> >>>>>>>>>>>> run at 1/100th of their proper speed. >>>>>>>>>>>> >>>>>>>>>>>> St.Ack >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> zhenyu >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58 PM, Zhenyu Zhong < >>>>>>>>>>>>> >>>>>>>>>>>> [email protected] >>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Stack >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am very appreciated for your comments. >>>>>>>>>>>>>> I will use the zookeeper monitoring script on my cluster >>>>>>>>>>>>>> >>>>>>>>>>>>> and >>>>> >>>>>> let >>>>>>>> >>>>>>>>> it >>>>>>>>> >>>>>>>>>> run >>>>>>>>>>> >>>>>>>>>>>> overnight to see the result. >>>>>>>>>>>>>> I will follow up the thread when I get anything. >>>>>>>>>>>>>> >>>>>>>>>>>>>> thanks >>>>>>>>>>>>>> zhenyu >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:52 PM, stack <[email protected]> >>>>>>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>>>>>>> I see these in your log too: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1. 2009-11-11 04:27:19,018 DEBUG >>>>>>>>>>>>>>> >>>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn: >>>>>>>>>> >>>>>>>>>>> Got >>>>>>>>>>>> >>>>>>>>>>>>> ping response for sessionid:0x424dfd908c50009 after >>>>>>>>>>>>>>> >>>>>>>>>>>>>> 4544ms >>>>> >>>>>> 2. 2009-11-11 04:27:19,018 DEBUG >>>>>>>>>>>>>>> >>>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn: >>>>>>>>>> >>>>>>>>>>> Got >>>>>>>>>>>> >>>>>>>>>>>>> ping response for sessionid:0x24dfd90c810002 after >>>>>>>>>>>>>>> >>>>>>>>>>>>>> 4548ms >>>>> >>>>>> 3. 2009-11-11 04:27:43,960 DEBUG >>>>>>>>>>>>>>> >>>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn: >>>>>>>>>> >>>>>>>>>>> Got >>>>>>>>>>>> >>>>>>>>>>>>> ping response for sessionid:0x424dfd908c50009 after >>>>>>>>>>>>>>> >>>>>>>>>>>>>> 9030ms >>>>> >>>>>> 4. 2009-11-11 04:27:43,960 DEBUG >>>>>>>>>>>>>>> >>>>>>>>>>>>>> org.apache.zookeeper. >>>>>>>>>> >>>>>>>>> >> >
