Re: regionserver disconnection

Zhenyu Zhong Mon, 16 Nov 2009 13:17:26 -0800

Here is the diskIO and CPU around the time we had RS disconnection on one
machine that runs RegionServer. It doesn't seem to be high. Similar disk and
cpu usage have been seen before and the HBase was running fine.



So far I haven't found why my 10 minutes session timeout doesn't apply.
Still digging.

Many thanks!

Time: 03:01:08 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.82    0.00    0.89    3.97    0.00   91.32

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              24.30      1605.39      6679.35 6238623388 25956251720
sda1              0.51         1.94        15.71    7547265   61067904
sda2              0.13         2.11         5.20    8202079   20189232
sda3              0.13         0.25         1.63     987846    6323336
sda4              0.00         0.00         0.00          4          0
sda5              1.78         0.06        24.14     220010   93817208
sda6              0.38         0.42         8.41    1630584   32688152
sda7             21.38      1600.61      6624.26 6220035272 25742165888
sdb               4.52       767.33       565.91 2981868690 2199132380
sdb1              4.52       767.33       565.91 2981866755 2199132372
sdc               4.42       742.95       563.00 2887151482 2187823092
sdc1              4.42       742.95       563.00 2887149547 2187823084
sdd               4.49       750.78       557.25 2917554074 2165513500
sdd1              4.49       750.78       557.25 2917552139 2165513492
sde               4.52       758.51       569.46 2947593394 2212964236
sde1              4.52       758.51       569.46 2947591459 2212964228
sdf               4.51       747.22       571.78 2903740266 2221972652
sdf1              4.51       747.22       571.78 2903738331 2221972644

Best,
zhenyu


On Mon, Nov 16, 2009 at 4:07 PM, stack <[email protected]> wrote:

> On Mon, Nov 16, 2009 at 12:05 PM, Zhenyu Zhong <[email protected]
> >wrote:
>
> > I just realized that there was a MapReduce job running during the time
> the
> > regionserver disconnected from the zookeeper.
> > That MapReduce Job was processing 500GB data and took about 8 minutes to
> > finish. It launched over 2000 map tasks.
>
>
> There was a tasktracker running on the RegionServer that disconnected?  Is
> the map i/o or cpu heavy?  Do you think it could have stole life from the
> datanode/regionserver?
>
>
>
> > I doubt that this introduced
> > resource contention between DataNode and RegionServer Node.
> > Also during the time that MapReduce job ran, I saw a few errors
> indicating
> > that
> >
> > java.io.IOException: Could not obtain block:
> > blk_7661247556283580226_4450462 file=/data/xxx.txt.200910141159
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1787)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1615)
> >        at
> > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1742)
> >        at java.io.DataInputStream.read(DataInputStream.java:83)
> >        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> >        at
> > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
> >        at
> > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
> >        at
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
> >        at
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> >        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> >        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> >        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> >
>
> This is interesting.  Your map task is running into hdfs-127.  Yeah, you
> should patch your cluster if you want to get rid of these (hdfs-127 has
> been
> applied to 0.21 hadoop and will be in the next hadoop release on 0.20.x
> branch, hadoop 0.20.2).
>
>
>
> >
> > Possibly it is related to HDFS-127, but I don't see any datanodes went
> > down. Does that affect Regionserver? Shall we apply the patch?
> >
> >
> Read up on hdfs-127 for explaination of whats going on.
>
>
>
> > Now I start to keep tracking of the virtual memory stats to see if the
> same
> > issue happens tomorrow around the same time period.
> >
>
> Excellent.  Thanks for doing this.
>
>
>
> > And I highly suspect that this particular MapReduce job hurts HBase
> > Regionserver.
> >
> > PS. I also use gcviewer to parse the GC-log, I only see around 30 seconds
> > pauses maximum.
> >
>
>
> Did you figure why your ten minute session is not applying?
>
> St.Ack
>
>
>
> >
> > Thanks a lot.
> > zhenyu
> >
> >
> >
> > On Sun, Nov 15, 2009 at 12:52 AM, Zhenyu Zhong <[email protected]
> > >wrote:
> >
> > > J-D,
> > >
> > > Thank you very much for your comments.
> > > My company block the IRC port, so I have trouble to connect to IRC
> > channel.
> > > I have been trying to ask the IT to open the IRC port for me, it might
> > take
> > > a while.
> > >
> > > Best,
> > > zhenyu
> > >
> > >
> > > On Sat, Nov 14, 2009 at 2:21 PM, Jean-Daniel Cryans <
> [email protected]
> > >wrote:
> > >
> > >> The error you are getting is a disconnection from a zookeeper server
> > >> and is very generic.
> > >>
> > >> ZK-86 is still opened and the last comment refers to ZK-111 saying
> > >> that the bug (in unit tests) was probably fixed in release 3.0.0 last
> > >> year.
> > >>
> > >> To figure the hang you have, you can try to jstack the process pid and
> > >> see exactly what's holding the RS from shutting down.
> > >>
> > >> Would it be possible for you to drop by the IRC channel? This way we
> > >> can debug this a at much faster pace.
> > >>
> > >> Thx!
> > >>
> > >> J-D
> > >>
> > >> On Sat, Nov 14, 2009 at 9:57 AM, Zhenyu Zhong <
> [email protected]>
> > >> wrote:
> > >> > I found this.
> > >> > http://issues.apache.org/jira/browse/ZOOKEEPER-86
> > >> >
> > >> > It looks like the same error I had. Is it a zookeeper bug? When will
> > >> HBase
> > >> > take the zookeeper version 3.3.0?
> > >> >
> > >> > thanks
> > >> > zhenyu
> > >> >
> > >> > On Sat, Nov 14, 2009 at 11:31 AM, Zhenyu Zhong <
> > [email protected]
> > >> >wrote:
> > >> >
> > >> >> Now I really doubt about the zookeeper, from the log I saw errors
> > like
> > >> >> IOException Read Error, while the zookeeper client (regionserver )
> > >> tried to
> > >> >> read. But it got disconneted status from zookeeper.
> > >> >>
> > >> >> I don't see any load on any zookeeper quorum servers. DiskIO is
> fine.
> > >> >>
> > >> >> Also when the regionserver decides to exit due to the disconnect
> > status
> > >> >> from zookeeper, sometimes the regionserver hangs during the
> exiting.
> > We
> > >> can
> > >> >> still see the HRegionServer process even we don't see it in the
> > Master
> > >> web
> > >> >> interface.
> > >> >>
> > >> >> I also notice that there is an zookeeper.retries and
> zookeeper.pauses
> > >> >> settings in hbase-default.xml, would that matter ?
> > >> >>
> > >> >> thanks
> > >> >>
> > >> >> You may see the IOException error in the following log when the
> > >> >> Regionserver lost connection to the zookeeper.
> > >> >>
> > >> >> 2009-11-14 15:58:31,993 INFO
> > org.apache.hadoop.hbase.regionserver.HLog:
> > >> >> Roll
> > >> >>
> > >>
> >
> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258179106921,
> > >> >> entries=1062964, calcsize=255014936, filesize=139566093. New hlog
> > >> >>
> > >>
> >
> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258214311993
> > >> >> 2009-11-14 15:58:39,295 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > >> >> 2009-11-14 15:58:52,648 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > >> >> 2009-11-14 15:59:06,007 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > >> >> 2009-11-14 15:59:19,365 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > >> >> 2009-11-14 15:59:32,867 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x424f0ed01ea00a6 after 186ms
> > >> >> 2009-11-14 15:59:46,070 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > >> >> 2009-11-14 15:59:59,423 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > >> >> 2009-11-14 16:00:12,775 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > >> >> 2009-11-14 16:00:28,141 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x424f0ed01ea00a6 after 2010ms
> > >> >> 2009-11-14 16:01:10,378 WARN org.apache.zookeeper.ClientCnxn:
> > Exception
> > >> >> closing session 0x424f0ed01ea00a6 to
> > >> sun.nio.ch.selectionkeyi...@39fbb2d6
> > >> >>
> > >> >> java.io.IOException: Read error rc = -1
> > java.nio.DirectByteBuffer[pos=0
> > >> >> lim=4 cap=4]
> > >> >>         at
> > >> >>
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
> > >> >>         at
> > >> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> > >> >> 2009-11-14 16:01:10,478 INFO
> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
> > >> event,
> > >> >> state: Disconnected, type: None, path: null
> > >> >> 2009-11-14 16:01:11,333 WARN
> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> > >> >>
> > >> >> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
> > >> >>         at
> > >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > >> >> Method)
> > >> >>         at
> > >> >>
> > >>
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> > >> >>         at
> > >> >>
> > >>
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> > >> >>         at
> > >> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
> > >> >>         at java.lang.Thread.run(Thread.java:619)
> > >> >> 2009-11-14 16:01:11,433 INFO org.apache.zookeeper.ClientCnxn:
> > >> Attempting
> > >> >> connection to server superpyxis0001/192.168.100.115:2181
> > >> >> 2009-11-14 16:01:11,433 INFO org.apache.zookeeper.ClientCnxn:
> Priming
> > >> >> connection to java.nio.channels.SocketChannel[connected local=/
> > >> >> 192.168.100.122:59575 remote=superpyxis0001/192.168.100.115:2181]
> > >> >> 2009-11-14 16:01:11,433 INFO org.apache.zookeeper.ClientCnxn:
> Server
> > >> >> connection successful
> > >> >> 2009-11-14 16:01:11,433 INFO
> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
> > >> event,
> > >> >> state: Expired, type: None, path: null
> > >> >> 2009-11-14 16:01:11,433 WARN org.apache.zookeeper.ClientCnxn:
> > Exception
> > >> >> closing session 0x424f0ed01ea00a6 to
> > >> sun.nio.ch.selectionkeyi...@6b93d343
> > >> >> java.io.IOException: Session Expired
> > >> >>         at
> > >> >>
> > >>
> >
> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
> > >> >>         at
> > >> >>
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
> > >> >>
> > >> >>         at
> > >> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> > >> >> 2009-11-14 16:01:11,433 ERROR
> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper
> session
> > >> >> expired
> > >> >> 2009-11-14 16:01:13,513 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > >> Server
> > >> >> handler 3 on 60021, call put([...@73deb204,
> > >> >> [Lorg.apache.hadoop.hbase.client.Put;@2179600a) from
> > >> 192.168.100.132:40728:
> > >> >> error: java.io.IOException: Server not running, aborting
> > >> >> java.io.IOException: Server not running, aborting
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >> On Sat, Nov 14, 2009 at 1:01 AM, Zhenyu Zhong <
> > [email protected]
> > >> >wrote:
> > >> >>
> > >> >>> Stack,
> > >> >>>
> > >> >>> Thanks a lot!
> > >> >>>
> > >> >>> I found out that the reason why HBase doesn't take the system file
> > >> >>> descriptor value I set before. I started the HBase using root
> > instead
> > >> of the
> > >> >>> normal hadoop user, while my system configures hadoop with higher
> > file
> > >> >>> descriptor value but configures root with 1024 default value.
> > >> >>>
> > >> >>>
> > >> >>> Now my system has a clean start. However, it seems that the
> > >> >>> zookeeper.session.timeout value doesn't take into effect. I still
> > >> found
> > >> >>> around 60 seconds pauses from the disconnected regionserver. I
> > really
> > >> don't
> > >> >>> know why regionserver only times out after 60 seconds instead of
> 10
> > >> minutes
> > >> >>> which I set for zookeeper.session.timeout.
> > >> >>>
> > >> >>> Is there any other timeout value coming into play before the
> actual
> > >> >>> session times out?
> > >> >>>
> > >> >>> zhenyu
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>> On Fri, Nov 13, 2009 at 7:08 PM, stack <[email protected]> wrote:
> > >> >>>
> > >> >>>> Ok.  Lack of file descriptors manifests in all kinds of weird
> ways.
> > >> >>>> Hopefully thats it.  If not, lets keep digging.
> > >> >>>> St.Ack
> > >> >>>>
> > >> >>>> On Fri, Nov 13, 2009 at 3:44 PM, Zhenyu Zhong <
> > >> [email protected]
> > >> >>>> >wrote:
> > >> >>>>
> > >> >>>> > Stack,
> > >> >>>> >
> > >> >>>> > You are right, the master started with ulimit -n 1024. It
> doesn't
> > >> take
> > >> >>>> the
> > >> >>>> > system value.
> > >> >>>> >
> > >> >>>> > Regarding to the too many open files, it looks like the same as
> > the
> > >> one
> > >> >>>> J-D
> > >> >>>> > put up. But I will get the Master start with higher value first
> > and
> > >> see
> > >> >>>> if
> > >> >>>> > this kind of error goes away.
> > >> >>>> >
> > >> >>>> > thanks a lot!
> > >> >>>> > zhenyu
> > >> >>>> >
> > >> >>>> > On Fri, Nov 13, 2009 at 6:02 PM, stack <[email protected]>
> wrote:
> > >> >>>> >
> > >> >>>> > > Does it say
> > >> >>>> > >
> > >> >>>> > > ulimit -n 32768
> > >> >>>> > >
> > >> >>>> > > ...as the second line in your log file on start of the
> master?
> > >> >>>> > >
> > >> >>>> > > Is the stack trace that complains about too many open files
> > same
> > >> as
> > >> >>>> the
> > >> >>>> > one
> > >> >>>> > > in the blog post J-D put up?
> > >> >>>> > >
> > >> >>>> > > St.Ack
> > >> >>>> > >
> > >> >>>> > >
> > >> >>>> > > On Fri, Nov 13, 2009 at 1:37 PM, Zhenyu Zhong <
> > >> >>>> [email protected]
> > >> >>>> > > >wrote:
> > >> >>>> > >
> > >> >>>> > > > The ulimit file descriptors was set to fs.file-max
> =1578334,
> > >> also
> > >> >>>> in
> > >> >>>> > the
> > >> >>>> > > > limits.conf the value was set to 32768.
> > >> >>>> > > > So these are way higher than the open descriptors for the
> > >> running
> > >> >>>> > > > processes.
> > >> >>>> > > >
> > >> >>>> > > > thanks
> > >> >>>> > > > zhenyu
> > >> >>>> > > >
> > >> >>>> > > >
> > >> >>>> > > > On Fri, Nov 13, 2009 at 4:33 PM, Stack <
> [email protected]>
> > >> >>>> wrote:
> > >> >>>> > > >
> > >> >>>> > > > > You upped the ulimit file descriptors as per the getting
> > >> started
> > >> >>>> doc?
> > >> >>>> > > > >
> > >> >>>> > > > >
> > >> >>>> > > > >
> > >> >>>> > > > > On Nov 13, 2009, at 1:26 PM, Zhenyu Zhong <
> > >> >>>> [email protected]>
> > >> >>>> > > > wrote:
> > >> >>>> > > > >
> > >> >>>> > > > >  Thanks a lot.
> > >> >>>> > > > >>
> > >> >>>> > > > >>
> > >> >>>> > > > >> Bad news is my kernel is still 2.6.26.
> > >> >>>> > > > >> But it was not a problem before.
> > >> >>>> > > > >>
> > >> >>>> > > > >> Very strange.
> > >> >>>> > > > >>
> > >> >>>> > > > >> zhenyu
> > >> >>>> > > > >>
> > >> >>>> > > > >> On Fri, Nov 13, 2009 at 4:16 PM, Jean-Daniel Cryans <
> > >> >>>> > > > [email protected]
> > >> >>>> > > > >> >wrote:
> > >> >>>> > > > >>
> > >> >>>> > > > >>  Looks like
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>
> > >> >>>> > > >
> > >> >>>> > >
> > >> >>>> >
> > >> >>>>
> > >>
> >
> http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/
> > >> >>>> > > > >>>
> > >> >>>> > > > >>> J-D
> > >> >>>> > > > >>>
> > >> >>>> > > > >>> On Fri, Nov 13, 2009 at 1:12 PM, Zhenyu Zhong <
> > >> >>>> > > [email protected]
> > >> >>>> > > > >
> > >> >>>> > > > >>> wrote:
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> Hi,
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>> After I re-organize the cluster, the experiment ran
> into
> > >> >>>> problem
> > >> >>>> > > > faster
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>> than
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> before.
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>> Basically the changes are to use machines with less
> > >> resources
> > >> >>>> as
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>> zookeeper
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> quorums and machines with more resources as
> > regionserver.
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>> From the log, I still see the pause around 1 minute.
> > >> >>>> > > > >>>> I enabled the GC logging,  please see
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>> http://pastebin.com/m1d4ce0f1
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>> for details.
> > >> >>>> > > > >>>> In general I don't see many pauses in the GC.
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>> What is more interesting, 1 hour after the 1st
> > >> regionserver
> > >> >>>> > > > >>>> disconnected,
> > >> >>>> > > > >>>> the master log complained about too many open files.
> > This
> > >> >>>> didn't
> > >> >>>> > > > happen
> > >> >>>> > > > >>>> before.
> > >> >>>> > > > >>>> I checked the system OS setup as well as the
> > limits.conf.
> > >> I
> > >> >>>> also
> > >> >>>> > > > checked
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>> the
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> running processes. The total open files don't reach
> the
> > >> limit.
> > >> >>>> I
> > >> >>>> > am
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>> confused
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> a bit.
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>> Please see the following master log.
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO
> > >> >>>> > > > org.apache.hadoop.hbase.master.BaseScanner:
> > >> >>>> > > > >>>> RegionManager.metaScanner scan of 4658 row(s) of meta
> > >> region
> > >> >>>> > > {server:
> > >> >>>> > > > >>>> 192.168.100.128:60021, regionname: .META.,,1,
> startKey:
> > >> <>}
> > >> >>>> > > complete
> > >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO
> > >> >>>> > > > org.apache.hadoop.hbase.master.BaseScanner:
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>> All
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> 1 .META. region(s) scanned
> > >> >>>> > > > >>>> 2009-11-13 20:06:07,677 DEBUG
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >>>> Got
> > >> >>>> > > > ping
> > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c after 3ms
> > >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > Exception
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>> in
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> createBlockOutputStream java.io.IOException: Bad
> connect
> > >> ack
> > >> >>>> with
> > >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010
> > >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > >>>> Abandoning
> > >> >>>> > > > >>>> block blk_-2808245019291145247_5478039
> > >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > Exception
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>> in
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> createBlockOutputStream java.io.EOFException
> > >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > >>>> Abandoning
> > >> >>>> > > > >>>> block blk_1074853606841896259_5478048
> > >> >>>> > > > >>>> 2009-11-13 20:06:10,334 DEBUG
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >>>> Got
> > >> >>>> > > > ping
> > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c after 1ms
> > >> >>>> > > > >>>> 2009-11-13 20:06:21,018 DEBUG
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >>>> Got
> > >> >>>> > > > ping
> > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c after 0ms
> > >> >>>> > > > >>>> 2009-11-13 20:06:23,674 DEBUG
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >>>> Got
> > >> >>>> > > > ping
> > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c after 0ms
> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > Exception
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>> in
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> createBlockOutputStream java.io.IOException: Bad
> connect
> > >> ack
> > >> >>>> with
> > >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010
> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > >>>> Abandoning
> > >> >>>> > > > >>>> block blk_-6642544517082142289_5478063
> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > Exception
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>> in
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too
> > many
> > >> >>>> open
> > >> >>>> > > files
> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > >>>> Abandoning
> > >> >>>> > > > >>>> block blk_2057511041109796090_5478063
> > >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > Exception
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>> in
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too
> > many
> > >> >>>> open
> > >> >>>> > > files
> > >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > >>>> Abandoning
> > >> >>>> > > > >>>> block blk_8219260302213892894_5478064
> > >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > Exception
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>> in
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too
> > many
> > >> >>>> open
> > >> >>>> > > files
> > >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > >>>> Abandoning
> > >> >>>> > > > >>>> block blk_1669205542853067709_5478235
> > >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > Exception
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>> in
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too
> > many
> > >> >>>> open
> > >> >>>> > > files
> > >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > >>>> Abandoning
> > >> >>>> > > > >>>> block blk_9128897691346270351_5478237
> > >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > Exception
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>> in
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too
> > many
> > >> >>>> open
> > >> >>>> > > files
> > >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > >>>> Abandoning
> > >> >>>> > > > >>>> block blk_1116845144864123018_5478240
> > >> >>>> > > > >>>> 2009-11-13 20:06:34,372 DEBUG
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >>>> Got
> > >> >>>> > > > ping
> > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c after 0ms
> > >> >>>> > > > >>>> 2009-11-13 20:06:37,034 DEBUG
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >>>> Got
> > >> >>>> > > > ping
> > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c after 0ms
> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>> DataStreamer
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> Exception: java.io.IOException: Too many open files
> > >> >>>> > > > >>>>      at sun.nio.ch.IOUtil.initPipe(Native Method)
> > >> >>>> > > > >>>>      at
> > >> >>>> > >
> sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
> > >> >>>> > > > >>>>      at
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>
> > >> >>>> > > >
> > >> >>>> > >
> > >> >>>> >
> > >> >>>>
> > >>
> >
> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>>      at
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>
> > >> >>>> > > >
> > >> >>>> > >
> > >> >>>> >
> > >> >>>>
> > >>
> >
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>>      at
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>
> > >> >>>> > > >
> > >> >>>> > >
> > >> >>>> >
> > >> >>>>
> > >>
> >
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>>      at
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>
> > >> >>>> > > >
> > >> >>>> > >
> > >> >>>> >
> > >> >>>>
> > >>
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>>      at
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>
> > >> >>>> > > >
> > >> >>>> > >
> > >> >>>> >
> > >> >>>>
> > >>
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>>      at
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>
> > >> >>>> > > >
> > >> >>>> > >
> > >> >>>> >
> > >> >>>>
> > >>
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>>      at
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>
> > >> >>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>>      at
> > >> >>>> java.io.DataOutputStream.write(DataOutputStream.java:90)
> > >> >>>> > > > >>>>      at
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>
> > >> >>>> > > >
> > >> >>>> > >
> > >> >>>> >
> > >> >>>>
> > >>
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > Error
> > >> >>>> > > > >>>> Recovery for block blk_8148813491785406356_5478475 bad
> > >> >>>> datanode[0]
> > >> >>>> > > > >>>> 192.168.100.123:50010
> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > Error
> > >> >>>> > > > >>>> Recovery for block blk_8148813491785406356_5478475 in
> > >> pipeline
> > >> >>>> > > > >>>> 192.168.100.123:50010, 192.168.100.134:50010,
> > >> >>>> > 192.168.100.122:50010
> > >> >>>> > > :
> > >> >>>> > > > >>>> bad
> > >> >>>> > > > >>>> datanode 192.168.100.123:50010
> > >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > Exception
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>> in
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too
> > many
> > >> >>>> open
> > >> >>>> > > files
> > >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO
> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >>>> > > > >>>> Abandoning
> > >> >>>> > > > >>>> block blk_2119727700857186236_5478498
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>> On Thu, Nov 12, 2009 at 4:21 PM, Zhenyu Zhong <
> > >> >>>> > > > [email protected]
> > >> >>>> > > > >>>> wrote:
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>  Will do.
> > >> >>>> > > > >>>>>
> > >> >>>> > > > >>>>> thanks
> > >> >>>> > > > >>>>> zhenyu
> > >> >>>> > > > >>>>>
> > >> >>>> > > > >>>>>
> > >> >>>> > > > >>>>> On Thu, Nov 12, 2009 at 3:33 PM, stack <
> > [email protected]
> > >> >
> > >> >>>> wrote:
> > >> >>>> > > > >>>>>
> > >> >>>> > > > >>>>>  Enable GC logging too on this next run (see
> > >> hbase-env.sh).
> > >> >>>>  Lets
> > >> >>>> > > try
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>> and
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> get
> > >> >>>> > > > >>>>>> to the bottom of whats going on.
> > >> >>>> > > > >>>>>> Thanks,
> > >> >>>> > > > >>>>>> St.Ack
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>> On Thu, Nov 12, 2009 at 12:29 PM, Zhenyu Zhong <
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>> [email protected]
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> wrote:
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>  I can switch the boxes that run zookeeper with the
> > ones
> > >> >>>> that
> > >> >>>> > run
> > >> >>>> > > > >>>>>>> regionservers.
> > >> >>>> > > > >>>>>>> I will see the results later.
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>> FYI. The node does have the 10 minutes
> > >> >>>> > zookeeper.session.timeout
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>> value
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> in
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> place.
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>> thanks
> > >> >>>> > > > >>>>>>> zhenyu
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>> On Thu, Nov 12, 2009 at 3:21 PM, stack <
> > >> [email protected]>
> > >> >>>> > wrote:
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>  On Thu, Nov 12, 2009 at 11:50 AM, Zhenyu Zhong <
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>> [email protected]
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> wrote:
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> In my cluster, half of the cluster have 2 disks
> > 400GB
> > >> >>>> each
> > >> >>>> > per
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>> machine,
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> and
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> half of the cluster have 6 disks per machine.
> >  Maybe
> > >> we
> > >> >>>> > should
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>> run
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> zookeeper
> > >> >>>> > > > >>>>>>>>> on the machines with 2 disks and RS on machines
> > with
> > >> 6
> > >> >>>> disks?
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>> That would make most sense only in the below, it
> > >> looks
> > >> >>>> like
> > >> >>>> > the
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>> RS
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> that
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> had
> > >> >>>> > > > >>>>>>>> issue had 4 disks?
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>  BTW, the 10 minutes zookeeper.session.timeout has
> > >> been
> > >> >>>> set
> > >> >>>> > > during
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>> the
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> experiment.
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>> And for sure this node had it in place?
> > >> >>>> > > > >>>>>>>> St.Ack
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> thanks
> > >> >>>> > > > >>>>>>>>> zhenyu
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>> On Thu, Nov 12, 2009 at 2:08 PM, stack <
> > >> [email protected]
> > >> >>>> >
> > >> >>>> > > wrote:
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>  On Thu, Nov 12, 2009 at 8:40 AM, Zhenyu Zhong <
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> [email protected]
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>> wrote:
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>  Though I experienced 2 regionserver
> disconnection
> > >> this
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> morning,
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> it
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> looks
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> better from the regionserver log. (please see
> the
> > >> >>>> following
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> log)
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>> http://pastebin.com/m496dbfae
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>> I checked diskIO for one of the
> > >> >>>> > regionserver(192.168.100.116)
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> during
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>> the
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> time it disconnected.
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>> Time: 03:04:58 AM
> > >> >>>> > > > >>>>>>>>>>> Device:            tps   Blk_read/s
> Blk_wrtn/s
> > >> >>>> Blk_read
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> Blk_wrtn
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> sda             105.31      5458.83     19503.64
> > >> >>>> 9043873239
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> 32312473676
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> sda1              2.90         6.64        99.25
> > >> >>>> 10993934
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> 164433464
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> sda2              1.72        23.76        30.25
> > >> >>>> 39365817
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> 50124008
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> sda3              0.30         0.38         3.58
> > >> >>>> 624930
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> 5923000
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> sda4            100.39      5428.06     19370.56
> > >> >>>> 8992888270
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> 32091993204
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>  Is this high for you?  20k blocks/second would
> > >> seem to
> > >> >>>> be
> > >> >>>> > > high
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> but
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> its
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>> one
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> disk only and its not being shared by zk anymore
> > so
> > >> >>>> > shouldn't
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> matter?
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>  I also checked the zookeeper quorum server that
> > the
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> regionserver
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> tried
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> to
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> connect according to the log. However, I don't
> see
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> 192.168.100.116
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> in
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>> the
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> client list of the zookeeper quorum server that
> > the
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> regionserver
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> tried
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> to
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> connect.
> > >> >>>> > > > >>>>>>>>>>> Would that be a problem?
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>  Is that because the ephemeral node for the
> > >> >>>> regionserver
> > >> >>>> > had
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> evaporated?
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>> Lost
> > >> >>>> > > > >>>>>>>>>> its lease w/ zk by the time you went to look?
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>  Thu Nov 12 15:04:52 UTC 2009
> > >> >>>> > > > >>>>>>>>>>> Zookeeper version: 3.2.1-808558, built on
> > >> 08/27/2009
> > >> >>>> 18:48
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> GMT
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> Clients:
> > >> >>>> > > > >>>>>>>>>>> /192.168.100.127:43045
> > >> [1](queued=0,recved=26,sent=0)
> > >> >>>> > > > >>>>>>>>>>> /192.168.100.131:39091
> > >> [1](queued=0,recved=964,sent=0)
> > >> >>>> > > > >>>>>>>>>>> /192.168.100.124:35961
> > >> [1](queued=0,recved=3266,sent=0)
> > >> >>>> > > > >>>>>>>>>>> /192.168.100.123:47935
> > >> [1](queued=0,recved=233,sent=0)
> > >> >>>> > > > >>>>>>>>>>> /192.168.100.125:46931
> > >> [1](queued=0,recved=2,sent=0)
> > >> >>>> > > > >>>>>>>>>>> /192.168.100.118:54924
> > >> [1](queued=0,recved=295,sent=0)
> > >> >>>> > > > >>>>>>>>>>> /192.168.100.118:41390
> > >> [1](queued=0,recved=2290,sent=0)
> > >> >>>> > > > >>>>>>>>>>> /192.168.100.136:42243
> > >> [1](queued=0,recved=0,sent=0)
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>> Latency min/avg/max: 0/17/6333
> > >> >>>> > > > >>>>>>>>>>> Received: 47111
> > >> >>>> > > > >>>>>>>>>>> Sent: 0
> > >> >>>> > > > >>>>>>>>>>> Outstanding: 0
> > >> >>>> > > > >>>>>>>>>>> Zxid: 0x77000083f4
> > >> >>>> > > > >>>>>>>>>>> Mode: leader
> > >> >>>> > > > >>>>>>>>>>> Node count: 23
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> That 6 second maximum latency is pretty bad but
> > >> should
> > >> >>>> be
> > >> >>>> > well
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> within
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> the
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> zk
> > >> >>>> > > > >>>>>>>>>> session timeout.
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> So, problem is likely on the zk client side of
> the
> > >> >>>> session;
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> i.e.
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> in
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> the
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>> RS.
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> You could enable GC logging as J-D suggested to
> > see
> > >> if
> > >> >>>> you
> > >> >>>> > > have
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> any
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> big
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>> pauses, pauses > zk session timeout.
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> When the RS went down, it didn't look too
> heavily
> > >> >>>> loaded:
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>  1. 2009-11-12 15:04:52,830 INFO
> > >> >>>> > > > >>>>>>>>>>
> >  org.apache.hadoop.hbase.regionserver.HRegionServer:
> > >> >>>> Dump of
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> metrics:
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>>  request=1.5166667, regions=322, stores=657,
> > >> >>>> storefiles=631,
> > >> >>>> > > > >>>>>>>>>>  storefileIndexSize=61, memstoreSize=1472,
> > >> >>>> usedHeap=2819,
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> maxHeap=4079,
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>>  blockCacheSize=658110960,
> > blockCacheFree=197395984,
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> blockCacheCount=9903,
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>  blockCacheHitRatio=99
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> Its serving a few reads?  The number of store
> > files
> > >> >>>> seems
> > >> >>>> > > fine.
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> Not
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> too
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> much memory used.
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> Looking at the logs, I see the Lease Still Held
> > >> >>>> exception.
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> This
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> happens
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> when the RS does its regular report to the master
> > but
> > >> the
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> master
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> thinks
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>> the
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> RS has since restarted.  It'll think this
> probably
> > >> >>>> because
> > >> >>>> > it
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> noticed
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> that
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> the RS's znode in zk had gone away and it
> > considered
> > >> the
> > >> >>>> RS
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> dead.
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>>
> > >> >>>> > > > >>>>>>>>>> Looking too at your  logs I see this gap in the
> zk
> > >> >>>> pinging:
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>  1. 2009-11-12 15:03:39,325 DEBUG
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> Got
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>>  ping response for sessionid:0x224e55436ad0004
> after
> > >> 0ms
> > >> >>>> > > > >>>>>>>>>>  2. 2009-11-12 15:03:43,113 DEBUG
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> Got
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>>  ping response for sessionid:0x24e55436a0007d
> after
> > >> 0ms
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> Where in the lines above it, its reporting about
> > >> every
> > >> >>>> ten
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> seconds,
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> here
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> there is a big gap.
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> Do you have ganglia or something that will let
> you
> > >> look
> > >> >>>> more
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> into
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> what
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>> was
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> happening on this machine around this time?  Is
> > the
> > >> >>>> machine
> > >> >>>> > > OK?
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> It
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> looks
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> lightly loaded and you have your cluster nicely
> > laid
> > >> out.
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> Something
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> odd
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> is
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> going on.  What about things like the write
> speed
> > to
> > >> >>>> disk?
> > >> >>>> >  In
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> the
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> past
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>> strange issues have been explained by incorrectly
> > set
> > >> BIOS
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>> which
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> made
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> disks
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> run at 1/100th of their proper speed.
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> St.Ack
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>> Best,
> > >> >>>> > > > >>>>>>>>>>> zhenyu
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58 PM, Zhenyu Zhong <
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> [email protected]
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> wrote:
> > >> >>>> > > > >>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>  Stack
> > >> >>>> > > > >>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>> I am very appreciated for your comments.
> > >> >>>> > > > >>>>>>>>>>>> I will use the zookeeper monitoring script on
> my
> > >> >>>> cluster
> > >> >>>> > > > >>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>> and
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>> let
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>> it
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>> run
> > >> >>>> > > > >>>>>>>>>
> > >> >>>> > > > >>>>>>>>>> overnight to see the result.
> > >> >>>> > > > >>>>>>>>>>>> I will follow up the thread when I get
> anything.
> > >> >>>> > > > >>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>> thanks
> > >> >>>> > > > >>>>>>>>>>>> zhenyu
> > >> >>>> > > > >>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:52 PM, stack <
> > >> >>>> [email protected]>
> > >> >>>> > > > >>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>> wrote:
> > >> >>>> > > > >>>>>>
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > > >>>>>>>>>>>>  I see these in your log too:
> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>>>  1. 2009-11-11 04:27:19,018 DEBUG
> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> Got
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>  ping response for sessionid:0x424dfd908c50009
> > >> after
> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>> 4544ms
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>>  2. 2009-11-11 04:27:19,018 DEBUG
> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> Got
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>  ping response for sessionid:0x24dfd90c810002
> > after
> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>> 4548ms
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>>  3. 2009-11-11 04:27:43,960 DEBUG
> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>>> Got
> > >> >>>> > > > >>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>  ping response for sessionid:0x424dfd908c50009
> > >> after
> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>> 9030ms
> > >> >>>> > > > >>>
> > >> >>>> > > > >>>>  4. 2009-11-11 04:27:43,960 DEBUG
> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.
> > >> >>>> > > > >>>>>>>>
> > >> >>>> > > > >>>>>>>
> > >> >>>> > > >
> > >> >>>> > >
> > >> >>>> >
> > >> >>>>
> > >> >>>
> > >> >>>
> > >> >>
> > >> >
> > >>
> > >
> > >
> >
>

Re: regionserver disconnection

Reply via email to