On Mon, Nov 16, 2009 at 12:05 PM, Zhenyu Zhong <[email protected]>wrote:
> I just realized that there was a MapReduce job running during the time the > regionserver disconnected from the zookeeper. > That MapReduce Job was processing 500GB data and took about 8 minutes to > finish. It launched over 2000 map tasks. There was a tasktracker running on the RegionServer that disconnected? Is the map i/o or cpu heavy? Do you think it could have stole life from the datanode/regionserver? > I doubt that this introduced > resource contention between DataNode and RegionServer Node. > Also during the time that MapReduce job ran, I saw a few errors indicating > that > > java.io.IOException: Could not obtain block: > blk_7661247556283580226_4450462 file=/data/xxx.txt.200910141159 > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1787) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1615) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1742) > at java.io.DataInputStream.read(DataInputStream.java:83) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136) > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > This is interesting. Your map task is running into hdfs-127. Yeah, you should patch your cluster if you want to get rid of these (hdfs-127 has been applied to 0.21 hadoop and will be in the next hadoop release on 0.20.x branch, hadoop 0.20.2). > > Possibly it is related to HDFS-127, but I don't see any datanodes went > down. Does that affect Regionserver? Shall we apply the patch? > > Read up on hdfs-127 for explaination of whats going on. > Now I start to keep tracking of the virtual memory stats to see if the same > issue happens tomorrow around the same time period. > Excellent. Thanks for doing this. > And I highly suspect that this particular MapReduce job hurts HBase > Regionserver. > > PS. I also use gcviewer to parse the GC-log, I only see around 30 seconds > pauses maximum. > Did you figure why your ten minute session is not applying? St.Ack > > Thanks a lot. > zhenyu > > > > On Sun, Nov 15, 2009 at 12:52 AM, Zhenyu Zhong <[email protected] > >wrote: > > > J-D, > > > > Thank you very much for your comments. > > My company block the IRC port, so I have trouble to connect to IRC > channel. > > I have been trying to ask the IT to open the IRC port for me, it might > take > > a while. > > > > Best, > > zhenyu > > > > > > On Sat, Nov 14, 2009 at 2:21 PM, Jean-Daniel Cryans <[email protected] > >wrote: > > > >> The error you are getting is a disconnection from a zookeeper server > >> and is very generic. > >> > >> ZK-86 is still opened and the last comment refers to ZK-111 saying > >> that the bug (in unit tests) was probably fixed in release 3.0.0 last > >> year. > >> > >> To figure the hang you have, you can try to jstack the process pid and > >> see exactly what's holding the RS from shutting down. > >> > >> Would it be possible for you to drop by the IRC channel? This way we > >> can debug this a at much faster pace. > >> > >> Thx! > >> > >> J-D > >> > >> On Sat, Nov 14, 2009 at 9:57 AM, Zhenyu Zhong <[email protected]> > >> wrote: > >> > I found this. > >> > http://issues.apache.org/jira/browse/ZOOKEEPER-86 > >> > > >> > It looks like the same error I had. Is it a zookeeper bug? When will > >> HBase > >> > take the zookeeper version 3.3.0? > >> > > >> > thanks > >> > zhenyu > >> > > >> > On Sat, Nov 14, 2009 at 11:31 AM, Zhenyu Zhong < > [email protected] > >> >wrote: > >> > > >> >> Now I really doubt about the zookeeper, from the log I saw errors > like > >> >> IOException Read Error, while the zookeeper client (regionserver ) > >> tried to > >> >> read. But it got disconneted status from zookeeper. > >> >> > >> >> I don't see any load on any zookeeper quorum servers. DiskIO is fine. > >> >> > >> >> Also when the regionserver decides to exit due to the disconnect > status > >> >> from zookeeper, sometimes the regionserver hangs during the exiting. > We > >> can > >> >> still see the HRegionServer process even we don't see it in the > Master > >> web > >> >> interface. > >> >> > >> >> I also notice that there is an zookeeper.retries and zookeeper.pauses > >> >> settings in hbase-default.xml, would that matter ? > >> >> > >> >> thanks > >> >> > >> >> You may see the IOException error in the following log when the > >> >> Regionserver lost connection to the zookeeper. > >> >> > >> >> 2009-11-14 15:58:31,993 INFO > org.apache.hadoop.hbase.regionserver.HLog: > >> >> Roll > >> >> > >> > /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258179106921, > >> >> entries=1062964, calcsize=255014936, filesize=139566093. New hlog > >> >> > >> > /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258214311993 > >> >> 2009-11-14 15:58:39,295 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms > >> >> 2009-11-14 15:58:52,648 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms > >> >> 2009-11-14 15:59:06,007 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms > >> >> 2009-11-14 15:59:19,365 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms > >> >> 2009-11-14 15:59:32,867 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping > >> >> response for sessionid:0x424f0ed01ea00a6 after 186ms > >> >> 2009-11-14 15:59:46,070 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms > >> >> 2009-11-14 15:59:59,423 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms > >> >> 2009-11-14 16:00:12,775 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms > >> >> 2009-11-14 16:00:28,141 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping > >> >> response for sessionid:0x424f0ed01ea00a6 after 2010ms > >> >> 2009-11-14 16:01:10,378 WARN org.apache.zookeeper.ClientCnxn: > Exception > >> >> closing session 0x424f0ed01ea00a6 to > >> sun.nio.ch.selectionkeyi...@39fbb2d6 > >> >> > >> >> java.io.IOException: Read error rc = -1 > java.nio.DirectByteBuffer[pos=0 > >> >> lim=4 cap=4] > >> >> at > >> >> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) > >> >> at > >> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) > >> >> 2009-11-14 16:01:10,478 INFO > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper > >> event, > >> >> state: Disconnected, type: None, path: null > >> >> 2009-11-14 16:01:11,333 WARN > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1 > >> >> > >> >> org.apache.hadoop.hbase.Leases$LeaseStillHeldException > >> >> at > >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > >> >> Method) > >> >> at > >> >> > >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > >> >> at > >> >> > >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > >> >> at > >> java.lang.reflect.Constructor.newInstance(Constructor.java:513) > >> >> at > >> >> > >> > org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94) > >> >> at > >> >> > >> > org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48) > >> >> at > >> >> > >> > org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66) > >> >> at > >> >> > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571) > >> >> at java.lang.Thread.run(Thread.java:619) > >> >> 2009-11-14 16:01:11,433 INFO org.apache.zookeeper.ClientCnxn: > >> Attempting > >> >> connection to server superpyxis0001/192.168.100.115:2181 > >> >> 2009-11-14 16:01:11,433 INFO org.apache.zookeeper.ClientCnxn: Priming > >> >> connection to java.nio.channels.SocketChannel[connected local=/ > >> >> 192.168.100.122:59575 remote=superpyxis0001/192.168.100.115:2181] > >> >> 2009-11-14 16:01:11,433 INFO org.apache.zookeeper.ClientCnxn: Server > >> >> connection successful > >> >> 2009-11-14 16:01:11,433 INFO > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper > >> event, > >> >> state: Expired, type: None, path: null > >> >> 2009-11-14 16:01:11,433 WARN org.apache.zookeeper.ClientCnxn: > Exception > >> >> closing session 0x424f0ed01ea00a6 to > >> sun.nio.ch.selectionkeyi...@6b93d343 > >> >> java.io.IOException: Session Expired > >> >> at > >> >> > >> > org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589) > >> >> at > >> >> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709) > >> >> > >> >> at > >> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) > >> >> 2009-11-14 16:01:11,433 ERROR > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session > >> >> expired > >> >> 2009-11-14 16:01:13,513 INFO org.apache.hadoop.ipc.HBaseServer: IPC > >> Server > >> >> handler 3 on 60021, call put([...@73deb204, > >> >> [Lorg.apache.hadoop.hbase.client.Put;@2179600a) from > >> 192.168.100.132:40728: > >> >> error: java.io.IOException: Server not running, aborting > >> >> java.io.IOException: Server not running, aborting > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> On Sat, Nov 14, 2009 at 1:01 AM, Zhenyu Zhong < > [email protected] > >> >wrote: > >> >> > >> >>> Stack, > >> >>> > >> >>> Thanks a lot! > >> >>> > >> >>> I found out that the reason why HBase doesn't take the system file > >> >>> descriptor value I set before. I started the HBase using root > instead > >> of the > >> >>> normal hadoop user, while my system configures hadoop with higher > file > >> >>> descriptor value but configures root with 1024 default value. > >> >>> > >> >>> > >> >>> Now my system has a clean start. However, it seems that the > >> >>> zookeeper.session.timeout value doesn't take into effect. I still > >> found > >> >>> around 60 seconds pauses from the disconnected regionserver. I > really > >> don't > >> >>> know why regionserver only times out after 60 seconds instead of 10 > >> minutes > >> >>> which I set for zookeeper.session.timeout. > >> >>> > >> >>> Is there any other timeout value coming into play before the actual > >> >>> session times out? > >> >>> > >> >>> zhenyu > >> >>> > >> >>> > >> >>> > >> >>> On Fri, Nov 13, 2009 at 7:08 PM, stack <[email protected]> wrote: > >> >>> > >> >>>> Ok. Lack of file descriptors manifests in all kinds of weird ways. > >> >>>> Hopefully thats it. If not, lets keep digging. > >> >>>> St.Ack > >> >>>> > >> >>>> On Fri, Nov 13, 2009 at 3:44 PM, Zhenyu Zhong < > >> [email protected] > >> >>>> >wrote: > >> >>>> > >> >>>> > Stack, > >> >>>> > > >> >>>> > You are right, the master started with ulimit -n 1024. It doesn't > >> take > >> >>>> the > >> >>>> > system value. > >> >>>> > > >> >>>> > Regarding to the too many open files, it looks like the same as > the > >> one > >> >>>> J-D > >> >>>> > put up. But I will get the Master start with higher value first > and > >> see > >> >>>> if > >> >>>> > this kind of error goes away. > >> >>>> > > >> >>>> > thanks a lot! > >> >>>> > zhenyu > >> >>>> > > >> >>>> > On Fri, Nov 13, 2009 at 6:02 PM, stack <[email protected]> wrote: > >> >>>> > > >> >>>> > > Does it say > >> >>>> > > > >> >>>> > > ulimit -n 32768 > >> >>>> > > > >> >>>> > > ...as the second line in your log file on start of the master? > >> >>>> > > > >> >>>> > > Is the stack trace that complains about too many open files > same > >> as > >> >>>> the > >> >>>> > one > >> >>>> > > in the blog post J-D put up? > >> >>>> > > > >> >>>> > > St.Ack > >> >>>> > > > >> >>>> > > > >> >>>> > > On Fri, Nov 13, 2009 at 1:37 PM, Zhenyu Zhong < > >> >>>> [email protected] > >> >>>> > > >wrote: > >> >>>> > > > >> >>>> > > > The ulimit file descriptors was set to fs.file-max =1578334, > >> also > >> >>>> in > >> >>>> > the > >> >>>> > > > limits.conf the value was set to 32768. > >> >>>> > > > So these are way higher than the open descriptors for the > >> running > >> >>>> > > > processes. > >> >>>> > > > > >> >>>> > > > thanks > >> >>>> > > > zhenyu > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > On Fri, Nov 13, 2009 at 4:33 PM, Stack <[email protected]> > >> >>>> wrote: > >> >>>> > > > > >> >>>> > > > > You upped the ulimit file descriptors as per the getting > >> started > >> >>>> doc? > >> >>>> > > > > > >> >>>> > > > > > >> >>>> > > > > > >> >>>> > > > > On Nov 13, 2009, at 1:26 PM, Zhenyu Zhong < > >> >>>> [email protected]> > >> >>>> > > > wrote: > >> >>>> > > > > > >> >>>> > > > > Thanks a lot. > >> >>>> > > > >> > >> >>>> > > > >> > >> >>>> > > > >> Bad news is my kernel is still 2.6.26. > >> >>>> > > > >> But it was not a problem before. > >> >>>> > > > >> > >> >>>> > > > >> Very strange. > >> >>>> > > > >> > >> >>>> > > > >> zhenyu > >> >>>> > > > >> > >> >>>> > > > >> On Fri, Nov 13, 2009 at 4:16 PM, Jean-Daniel Cryans < > >> >>>> > > > [email protected] > >> >>>> > > > >> >wrote: > >> >>>> > > > >> > >> >>>> > > > >> Looks like > >> >>>> > > > >>> > >> >>>> > > > >>> > >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> > http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/ > >> >>>> > > > >>> > >> >>>> > > > >>> J-D > >> >>>> > > > >>> > >> >>>> > > > >>> On Fri, Nov 13, 2009 at 1:12 PM, Zhenyu Zhong < > >> >>>> > > [email protected] > >> >>>> > > > > > >> >>>> > > > >>> wrote: > >> >>>> > > > >>> > >> >>>> > > > >>>> Hi, > >> >>>> > > > >>>> > >> >>>> > > > >>>> > >> >>>> > > > >>>> After I re-organize the cluster, the experiment ran into > >> >>>> problem > >> >>>> > > > faster > >> >>>> > > > >>>> > >> >>>> > > > >>> than > >> >>>> > > > >>> > >> >>>> > > > >>>> before. > >> >>>> > > > >>>> > >> >>>> > > > >>>> Basically the changes are to use machines with less > >> resources > >> >>>> as > >> >>>> > > > >>>> > >> >>>> > > > >>> zookeeper > >> >>>> > > > >>> > >> >>>> > > > >>>> quorums and machines with more resources as > regionserver. > >> >>>> > > > >>>> > >> >>>> > > > >>>> From the log, I still see the pause around 1 minute. > >> >>>> > > > >>>> I enabled the GC logging, please see > >> >>>> > > > >>>> > >> >>>> > > > >>>> http://pastebin.com/m1d4ce0f1 > >> >>>> > > > >>>> > >> >>>> > > > >>>> for details. > >> >>>> > > > >>>> In general I don't see many pauses in the GC. > >> >>>> > > > >>>> > >> >>>> > > > >>>> What is more interesting, 1 hour after the 1st > >> regionserver > >> >>>> > > > >>>> disconnected, > >> >>>> > > > >>>> the master log complained about too many open files. > This > >> >>>> didn't > >> >>>> > > > happen > >> >>>> > > > >>>> before. > >> >>>> > > > >>>> I checked the system OS setup as well as the > limits.conf. > >> I > >> >>>> also > >> >>>> > > > checked > >> >>>> > > > >>>> > >> >>>> > > > >>> the > >> >>>> > > > >>> > >> >>>> > > > >>>> running processes. The total open files don't reach the > >> limit. > >> >>>> I > >> >>>> > am > >> >>>> > > > >>>> > >> >>>> > > > >>> confused > >> >>>> > > > >>> > >> >>>> > > > >>>> a bit. > >> >>>> > > > >>>> > >> >>>> > > > >>>> > >> >>>> > > > >>>> Please see the following master log. > >> >>>> > > > >>>> > >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO > >> >>>> > > > org.apache.hadoop.hbase.master.BaseScanner: > >> >>>> > > > >>>> RegionManager.metaScanner scan of 4658 row(s) of meta > >> region > >> >>>> > > {server: > >> >>>> > > > >>>> 192.168.100.128:60021, regionname: .META.,,1, startKey: > >> <>} > >> >>>> > > complete > >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO > >> >>>> > > > org.apache.hadoop.hbase.master.BaseScanner: > >> >>>> > > > >>>> > >> >>>> > > > >>> All > >> >>>> > > > >>> > >> >>>> > > > >>>> 1 .META. region(s) scanned > >> >>>> > > > >>>> 2009-11-13 20:06:07,677 DEBUG > >> org.apache.zookeeper.ClientCnxn: > >> >>>> Got > >> >>>> > > > ping > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c after 3ms > >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > Exception > >> >>>> > > > >>>> > >> >>>> > > > >>> in > >> >>>> > > > >>> > >> >>>> > > > >>>> createBlockOutputStream java.io.IOException: Bad connect > >> ack > >> >>>> with > >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010 > >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > >>>> Abandoning > >> >>>> > > > >>>> block blk_-2808245019291145247_5478039 > >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > Exception > >> >>>> > > > >>>> > >> >>>> > > > >>> in > >> >>>> > > > >>> > >> >>>> > > > >>>> createBlockOutputStream java.io.EOFException > >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > >>>> Abandoning > >> >>>> > > > >>>> block blk_1074853606841896259_5478048 > >> >>>> > > > >>>> 2009-11-13 20:06:10,334 DEBUG > >> org.apache.zookeeper.ClientCnxn: > >> >>>> Got > >> >>>> > > > ping > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c after 1ms > >> >>>> > > > >>>> 2009-11-13 20:06:21,018 DEBUG > >> org.apache.zookeeper.ClientCnxn: > >> >>>> Got > >> >>>> > > > ping > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c after 0ms > >> >>>> > > > >>>> 2009-11-13 20:06:23,674 DEBUG > >> org.apache.zookeeper.ClientCnxn: > >> >>>> Got > >> >>>> > > > ping > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c after 0ms > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > Exception > >> >>>> > > > >>>> > >> >>>> > > > >>> in > >> >>>> > > > >>> > >> >>>> > > > >>>> createBlockOutputStream java.io.IOException: Bad connect > >> ack > >> >>>> with > >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010 > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > >>>> Abandoning > >> >>>> > > > >>>> block blk_-6642544517082142289_5478063 > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > Exception > >> >>>> > > > >>>> > >> >>>> > > > >>> in > >> >>>> > > > >>> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too > many > >> >>>> open > >> >>>> > > files > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > >>>> Abandoning > >> >>>> > > > >>>> block blk_2057511041109796090_5478063 > >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > Exception > >> >>>> > > > >>>> > >> >>>> > > > >>> in > >> >>>> > > > >>> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too > many > >> >>>> open > >> >>>> > > files > >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > >>>> Abandoning > >> >>>> > > > >>>> block blk_8219260302213892894_5478064 > >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > Exception > >> >>>> > > > >>>> > >> >>>> > > > >>> in > >> >>>> > > > >>> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too > many > >> >>>> open > >> >>>> > > files > >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > >>>> Abandoning > >> >>>> > > > >>>> block blk_1669205542853067709_5478235 > >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > Exception > >> >>>> > > > >>>> > >> >>>> > > > >>> in > >> >>>> > > > >>> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too > many > >> >>>> open > >> >>>> > > files > >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > >>>> Abandoning > >> >>>> > > > >>>> block blk_9128897691346270351_5478237 > >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > Exception > >> >>>> > > > >>>> > >> >>>> > > > >>> in > >> >>>> > > > >>> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too > many > >> >>>> open > >> >>>> > > files > >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > >>>> Abandoning > >> >>>> > > > >>>> block blk_1116845144864123018_5478240 > >> >>>> > > > >>>> 2009-11-13 20:06:34,372 DEBUG > >> org.apache.zookeeper.ClientCnxn: > >> >>>> Got > >> >>>> > > > ping > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c after 0ms > >> >>>> > > > >>>> 2009-11-13 20:06:37,034 DEBUG > >> org.apache.zookeeper.ClientCnxn: > >> >>>> Got > >> >>>> > > > ping > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c after 0ms > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > >>>> > >> >>>> > > > >>> DataStreamer > >> >>>> > > > >>> > >> >>>> > > > >>>> Exception: java.io.IOException: Too many open files > >> >>>> > > > >>>> at sun.nio.ch.IOUtil.initPipe(Native Method) > >> >>>> > > > >>>> at > >> >>>> > > sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49) > >> >>>> > > > >>>> at > >> >>>> > > > >>>> > >> >>>> > > > >>>> > >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> > sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18) > >> >>>> > > > >>> > >> >>>> > > > >>>> at > >> >>>> > > > >>>> > >> >>>> > > > >>>> > >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407) > >> >>>> > > > >>> > >> >>>> > > > >>>> at > >> >>>> > > > >>>> > >> >>>> > > > >>>> > >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322) > >> >>>> > > > >>> > >> >>>> > > > >>>> at > >> >>>> > > > >>>> > >> >>>> > > > >>>> > >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > >> >>>> > > > >>> > >> >>>> > > > >>>> at > >> >>>> > > > >>>> > >> >>>> > > > >>>> > >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > >> >>>> > > > >>> > >> >>>> > > > >>>> at > >> >>>> > > > >>>> > >> >>>> > > > >>>> > >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > >> >>>> > > > >>> > >> >>>> > > > >>>> at > >> >>>> > > > >>>> > >> >>>> > > > >>> > >> >>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) > >> >>>> > > > >>> > >> >>>> > > > >>>> at > >> >>>> java.io.DataOutputStream.write(DataOutputStream.java:90) > >> >>>> > > > >>>> at > >> >>>> > > > >>>> > >> >>>> > > > >>>> > >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290) > >> >>>> > > > >>> > >> >>>> > > > >>>> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > Error > >> >>>> > > > >>>> Recovery for block blk_8148813491785406356_5478475 bad > >> >>>> datanode[0] > >> >>>> > > > >>>> 192.168.100.123:50010 > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > Error > >> >>>> > > > >>>> Recovery for block blk_8148813491785406356_5478475 in > >> pipeline > >> >>>> > > > >>>> 192.168.100.123:50010, 192.168.100.134:50010, > >> >>>> > 192.168.100.122:50010 > >> >>>> > > : > >> >>>> > > > >>>> bad > >> >>>> > > > >>>> datanode 192.168.100.123:50010 > >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > Exception > >> >>>> > > > >>>> > >> >>>> > > > >>> in > >> >>>> > > > >>> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too > many > >> >>>> open > >> >>>> > > files > >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO > >> org.apache.hadoop.hdfs.DFSClient: > >> >>>> > > > >>>> Abandoning > >> >>>> > > > >>>> block blk_2119727700857186236_5478498 > >> >>>> > > > >>>> > >> >>>> > > > >>>> > >> >>>> > > > >>>> > >> >>>> > > > >>>> > >> >>>> > > > >>>> > >> >>>> > > > >>>> On Thu, Nov 12, 2009 at 4:21 PM, Zhenyu Zhong < > >> >>>> > > > [email protected] > >> >>>> > > > >>>> wrote: > >> >>>> > > > >>>> > >> >>>> > > > >>>> Will do. > >> >>>> > > > >>>>> > >> >>>> > > > >>>>> thanks > >> >>>> > > > >>>>> zhenyu > >> >>>> > > > >>>>> > >> >>>> > > > >>>>> > >> >>>> > > > >>>>> On Thu, Nov 12, 2009 at 3:33 PM, stack < > [email protected] > >> > > >> >>>> wrote: > >> >>>> > > > >>>>> > >> >>>> > > > >>>>> Enable GC logging too on this next run (see > >> hbase-env.sh). > >> >>>> Lets > >> >>>> > > try > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>> and > >> >>>> > > > >>> > >> >>>> > > > >>>> get > >> >>>> > > > >>>>>> to the bottom of whats going on. > >> >>>> > > > >>>>>> Thanks, > >> >>>> > > > >>>>>> St.Ack > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>> On Thu, Nov 12, 2009 at 12:29 PM, Zhenyu Zhong < > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>> [email protected] > >> >>>> > > > >>> > >> >>>> > > > >>>> wrote: > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>> I can switch the boxes that run zookeeper with the > ones > >> >>>> that > >> >>>> > run > >> >>>> > > > >>>>>>> regionservers. > >> >>>> > > > >>>>>>> I will see the results later. > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>> FYI. The node does have the 10 minutes > >> >>>> > zookeeper.session.timeout > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>> value > >> >>>> > > > >>> > >> >>>> > > > >>>> in > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> place. > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>> thanks > >> >>>> > > > >>>>>>> zhenyu > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>> On Thu, Nov 12, 2009 at 3:21 PM, stack < > >> [email protected]> > >> >>>> > wrote: > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>> On Thu, Nov 12, 2009 at 11:50 AM, Zhenyu Zhong < > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>> [email protected] > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> wrote: > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> In my cluster, half of the cluster have 2 disks > 400GB > >> >>>> each > >> >>>> > per > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>> machine, > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> and > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> half of the cluster have 6 disks per machine. > Maybe > >> we > >> >>>> > should > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>> run > >> >>>> > > > >>> > >> >>>> > > > >>>> zookeeper > >> >>>> > > > >>>>>>>>> on the machines with 2 disks and RS on machines > with > >> 6 > >> >>>> disks? > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>> That would make most sense only in the below, it > >> looks > >> >>>> like > >> >>>> > the > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>> RS > >> >>>> > > > >>> > >> >>>> > > > >>>> that > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> had > >> >>>> > > > >>>>>>>> issue had 4 disks? > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>> BTW, the 10 minutes zookeeper.session.timeout has > >> been > >> >>>> set > >> >>>> > > during > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>> the > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> experiment. > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>> And for sure this node had it in place? > >> >>>> > > > >>>>>>>> St.Ack > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> thanks > >> >>>> > > > >>>>>>>>> zhenyu > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>> On Thu, Nov 12, 2009 at 2:08 PM, stack < > >> [email protected] > >> >>>> > > >> >>>> > > wrote: > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>> On Thu, Nov 12, 2009 at 8:40 AM, Zhenyu Zhong < > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> [email protected] > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>> wrote: > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> Though I experienced 2 regionserver disconnection > >> this > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> morning, > >> >>>> > > > >>> > >> >>>> > > > >>>> it > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> looks > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>>> better from the regionserver log. (please see the > >> >>>> following > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> log) > >> >>>> > > > >>> > >> >>>> > > > >>>> > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> http://pastebin.com/m496dbfae > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> I checked diskIO for one of the > >> >>>> > regionserver(192.168.100.116) > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> during > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>> the > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>>> time it disconnected. > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> Time: 03:04:58 AM > >> >>>> > > > >>>>>>>>>>> Device: tps Blk_read/s Blk_wrtn/s > >> >>>> Blk_read > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> Blk_wrtn > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> sda 105.31 5458.83 19503.64 > >> >>>> 9043873239 > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> 32312473676 > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> sda1 2.90 6.64 99.25 > >> >>>> 10993934 > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> 164433464 > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> sda2 1.72 23.76 30.25 > >> >>>> 39365817 > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> 50124008 > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> sda3 0.30 0.38 3.58 > >> >>>> 624930 > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> 5923000 > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> sda4 100.39 5428.06 19370.56 > >> >>>> 8992888270 > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> 32091993204 > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> Is this high for you? 20k blocks/second would > >> seem to > >> >>>> be > >> >>>> > > high > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> but > >> >>>> > > > >>> > >> >>>> > > > >>>> its > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>> one > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>>> disk only and its not being shared by zk anymore > so > >> >>>> > shouldn't > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> matter? > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> I also checked the zookeeper quorum server that > the > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> regionserver > >> >>>> > > > >>> > >> >>>> > > > >>>> tried > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> to > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>>> connect according to the log. However, I don't see > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> 192.168.100.116 > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> in > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>> the > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>>> client list of the zookeeper quorum server that > the > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> regionserver > >> >>>> > > > >>> > >> >>>> > > > >>>> tried > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> to > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>>> connect. > >> >>>> > > > >>>>>>>>>>> Would that be a problem? > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> Is that because the ephemeral node for the > >> >>>> regionserver > >> >>>> > had > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> evaporated? > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>> Lost > >> >>>> > > > >>>>>>>>>> its lease w/ zk by the time you went to look? > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> Thu Nov 12 15:04:52 UTC 2009 > >> >>>> > > > >>>>>>>>>>> Zookeeper version: 3.2.1-808558, built on > >> 08/27/2009 > >> >>>> 18:48 > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> GMT > >> >>>> > > > >>> > >> >>>> > > > >>>> Clients: > >> >>>> > > > >>>>>>>>>>> /192.168.100.127:43045 > >> [1](queued=0,recved=26,sent=0) > >> >>>> > > > >>>>>>>>>>> /192.168.100.131:39091 > >> [1](queued=0,recved=964,sent=0) > >> >>>> > > > >>>>>>>>>>> /192.168.100.124:35961 > >> [1](queued=0,recved=3266,sent=0) > >> >>>> > > > >>>>>>>>>>> /192.168.100.123:47935 > >> [1](queued=0,recved=233,sent=0) > >> >>>> > > > >>>>>>>>>>> /192.168.100.125:46931 > >> [1](queued=0,recved=2,sent=0) > >> >>>> > > > >>>>>>>>>>> /192.168.100.118:54924 > >> [1](queued=0,recved=295,sent=0) > >> >>>> > > > >>>>>>>>>>> /192.168.100.118:41390 > >> [1](queued=0,recved=2290,sent=0) > >> >>>> > > > >>>>>>>>>>> /192.168.100.136:42243 > >> [1](queued=0,recved=0,sent=0) > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> Latency min/avg/max: 0/17/6333 > >> >>>> > > > >>>>>>>>>>> Received: 47111 > >> >>>> > > > >>>>>>>>>>> Sent: 0 > >> >>>> > > > >>>>>>>>>>> Outstanding: 0 > >> >>>> > > > >>>>>>>>>>> Zxid: 0x77000083f4 > >> >>>> > > > >>>>>>>>>>> Mode: leader > >> >>>> > > > >>>>>>>>>>> Node count: 23 > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> That 6 second maximum latency is pretty bad but > >> should > >> >>>> be > >> >>>> > well > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> within > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> the > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> zk > >> >>>> > > > >>>>>>>>>> session timeout. > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> So, problem is likely on the zk client side of the > >> >>>> session; > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> i.e. > >> >>>> > > > >>> > >> >>>> > > > >>>> in > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> the > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>> RS. > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>>> You could enable GC logging as J-D suggested to > see > >> if > >> >>>> you > >> >>>> > > have > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> any > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> big > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>> pauses, pauses > zk session timeout. > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> When the RS went down, it didn't look too heavily > >> >>>> loaded: > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> 1. 2009-11-12 15:04:52,830 INFO > >> >>>> > > > >>>>>>>>>> > org.apache.hadoop.hbase.regionserver.HRegionServer: > >> >>>> Dump of > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> metrics: > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>> request=1.5166667, regions=322, stores=657, > >> >>>> storefiles=631, > >> >>>> > > > >>>>>>>>>> storefileIndexSize=61, memstoreSize=1472, > >> >>>> usedHeap=2819, > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> maxHeap=4079, > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> blockCacheSize=658110960, > blockCacheFree=197395984, > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> blockCacheCount=9903, > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>>> blockCacheHitRatio=99 > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> Its serving a few reads? The number of store > files > >> >>>> seems > >> >>>> > > fine. > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> Not > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> too > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> much memory used. > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> Looking at the logs, I see the Lease Still Held > >> >>>> exception. > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> This > >> >>>> > > > >>> > >> >>>> > > > >>>> happens > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> when the RS does its regular report to the master > but > >> the > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> master > >> >>>> > > > >>> > >> >>>> > > > >>>> thinks > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>> the > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>>> RS has since restarted. It'll think this probably > >> >>>> because > >> >>>> > it > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> noticed > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> that > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>>> the RS's znode in zk had gone away and it > considered > >> the > >> >>>> RS > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> dead. > >> >>>> > > > >>> > >> >>>> > > > >>>> > >> >>>> > > > >>>>>>>>>> Looking too at your logs I see this gap in the zk > >> >>>> pinging: > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> 1. 2009-11-12 15:03:39,325 DEBUG > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn: > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> Got > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>> ping response for sessionid:0x224e55436ad0004 after > >> 0ms > >> >>>> > > > >>>>>>>>>> 2. 2009-11-12 15:03:43,113 DEBUG > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn: > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> Got > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>> ping response for sessionid:0x24e55436a0007d after > >> 0ms > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> Where in the lines above it, its reporting about > >> every > >> >>>> ten > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> seconds, > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> here > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> there is a big gap. > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> Do you have ganglia or something that will let you > >> look > >> >>>> more > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> into > >> >>>> > > > >>> > >> >>>> > > > >>>> what > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>> was > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>>> happening on this machine around this time? Is > the > >> >>>> machine > >> >>>> > > OK? > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> It > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> looks > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> lightly loaded and you have your cluster nicely > laid > >> out. > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> Something > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> odd > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> is > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>>> going on. What about things like the write speed > to > >> >>>> disk? > >> >>>> > In > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> the > >> >>>> > > > >>> > >> >>>> > > > >>>> past > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>> strange issues have been explained by incorrectly > set > >> BIOS > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>> which > >> >>>> > > > >>> > >> >>>> > > > >>>> made > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> disks > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>>> run at 1/100th of their proper speed. > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> St.Ack > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> Best, > >> >>>> > > > >>>>>>>>>>> zhenyu > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58 PM, Zhenyu Zhong < > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>> [email protected] > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> wrote: > >> >>>> > > > >>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> Stack > >> >>>> > > > >>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>>> I am very appreciated for your comments. > >> >>>> > > > >>>>>>>>>>>> I will use the zookeeper monitoring script on my > >> >>>> cluster > >> >>>> > > > >>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> and > >> >>>> > > > >>> > >> >>>> > > > >>>> let > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> it > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>> run > >> >>>> > > > >>>>>>>>> > >> >>>> > > > >>>>>>>>>> overnight to see the result. > >> >>>> > > > >>>>>>>>>>>> I will follow up the thread when I get anything. > >> >>>> > > > >>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>>> thanks > >> >>>> > > > >>>>>>>>>>>> zhenyu > >> >>>> > > > >>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:52 PM, stack < > >> >>>> [email protected]> > >> >>>> > > > >>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> wrote: > >> >>>> > > > >>>>>> > >> >>>> > > > >>>>>>> > >> >>>> > > > >>>>>>>>>>>> I see these in your log too: > >> >>>> > > > >>>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>>>> 1. 2009-11-11 04:27:19,018 DEBUG > >> >>>> > > > >>>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn: > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> Got > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> ping response for sessionid:0x424dfd908c50009 > >> after > >> >>>> > > > >>>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>>> 4544ms > >> >>>> > > > >>> > >> >>>> > > > >>>> 2. 2009-11-11 04:27:19,018 DEBUG > >> >>>> > > > >>>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn: > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> Got > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> ping response for sessionid:0x24dfd90c810002 > after > >> >>>> > > > >>>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>>> 4548ms > >> >>>> > > > >>> > >> >>>> > > > >>>> 3. 2009-11-11 04:27:43,960 DEBUG > >> >>>> > > > >>>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn: > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>>>> Got > >> >>>> > > > >>>>>>>>>> > >> >>>> > > > >>>>>>>>>>> ping response for sessionid:0x424dfd908c50009 > >> after > >> >>>> > > > >>>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>>> 9030ms > >> >>>> > > > >>> > >> >>>> > > > >>>> 4. 2009-11-11 04:27:43,960 DEBUG > >> >>>> > > > >>>>>>>>>>>>> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper. > >> >>>> > > > >>>>>>>> > >> >>>> > > > >>>>>>> > >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> >>> > >> >>> > >> >> > >> > > >> > > > > >
