Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)

stack Thu, 29 Oct 2009 15:38:24 -0700

You can restart the regionserver and it'll rejoin the cluster.  Master will
assign it regions again.  We use to autorestart but was taken out; thought
is some baby-sitter watching the process would be better at this.
St.Ack


On Thu, Oct 29, 2009 at 3:35 PM, Zhenyu Zhong <[email protected]>wrote:

> BTW, if a regionserver disconnect, would a mannually restart on this
> particular regionserver helps ?
>
> Best,
> zhenyu
>
>
> On Thu, Oct 29, 2009 at 6:26 PM, Zhenyu Zhong <[email protected]
> >wrote:
>
> > Stack,
> >
> > I am very appreciated for your help.
> > Yes, I am  using cacti to monitor the loads etc. I also upped my zk
> seesion
> > timeout to 600sec.
> > May I ask what the default connection timeout for a zookeeper client to
> > connect to a quorum?
> >
> > thanks
> > zhenyu
> >
> >
> >
> > On Thu, Oct 29, 2009 at 6:06 PM, stack <[email protected]> wrote:
> >
> >> If it stole machine resources, yeah, it could.  Do you have anything to
> >> watch your cluster with in place?  Ganglia or some such so you can watch
> >> the
> >> loadings?  Is the machine with the RS that is going down swapping?  You
> >> could try upping your zk session timeout in your hbase cluster.
> >> St.Ack
> >>
> >> On Thu, Oct 29, 2009 at 3:00 PM, Zhenyu Zhong <[email protected]
> >> >wrote:
> >>
> >> > Anything that possibly gets started is another MR job working on other
> >> > dataset in the same time as this test was running. So some node might
> >>  be
> >> > under heavy loads.
> >> > I am wondering whether that would cause the connection timeout.
> >> >
> >> > thanks
> >> > zhenyu
> >> >
> >> >
> >> >
> >> > On Thu, Oct 29, 2009 at 5:32 PM, stack <[email protected]> wrote:
> >> >
> >> > > On Thu, Oct 29, 2009 at 2:23 PM, Zhenyu Zhong <
> >> [email protected]
> >> > > >wrote:
> >> > >
> >> > > > I have 19 quorum members now.
> >> > > >
> >> > > > Thats too many.  Have 3 or maybe 5.  See zk site for rationale.
> >> > >
> >> > >
> >> > >
> >> > > > When I did test on loading data to two columnfamilies of one table
> >> in
> >> > > HBase
> >> > > > using two seperate MR jobs, I lost my regionserver and the test
> >> failed.
> >> > > >
> >> > > > Does HBase allow such table update operation?
> >> > > >
> >> > > > The errors I got while I lost my regionserver is:
> >> > > > 2009-10-29 21:09:34,705 INFO
> >> org.apache.hadoop.hbase.regionserver.HLog:
> >> > > > Roll
> >> > > > /hbase/.logs/YYYY,60021,1256849619429/hlog.d
> >> > > > at.1256849620029, entries=271911, calcsize=63754142,
> >> filesize=33975611.
> >> > > New
> >> > > > hlog /hbase/.logs/YYYY,60021,1256849619429/hl
> >> > > > og.dat.1256850574705
> >> > > > 2009-10-29 21:09:50,322 WARN
> >> > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> >> > > > org.apache.hadoop.hbase.Leases$LeaseStillHeldException
> >> > > >
> >> > >
> >> > >
> >> > > You have read the 'Getting Started' and made the necessary changes
> to
> >> > > filedescriptors and xceiver count?
> >> > >
> >> > > You will see above message after a regionserver has restarted and
> >> tries
> >> > to
> >> > > go back to the master (what hbase is this? I think you said it
> >> 0.20.x).
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > > java.io.IOException: TIMED OUT
> >> > > >        at
> >> > > >
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> >> > > > 2009-10-29 21:09:50,873 INFO
> >> > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
> >> > event,
> >> > > > state: Disconnected, type: None, path:
> >> > > > null
> >> > > >
> >> > >
> >> > > This is timeout against zk.  You've lost your session.  The RS will
> go
> >> > > down.  The connection to zk is basic to hbase.  Something is up.  In
> >> the
> >> > > past others have reported things like incorrect bios settings on
> disks
> >> > that
> >> > > have made the disks run slow or just something up with the
> networking.
> >> >  Can
> >> > > you check all is healthy?  You seem to be having too many issues for
> >> such
> >> > a
> >> > > small loading with such a large cluster.
> >> > >
> >> > > St.Ack
> >> > >
> >> > >
> >> > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Thu, Oct 29, 2009 at 2:51 PM, stack <[email protected]> wrote:
> >> > > >
> >> > > > > On Thu, Oct 29, 2009 at 11:46 AM, Zhenyu Zhong <
> >> > > [email protected]
> >> > > > > >wrote:
> >> > > > >
> >> > > > > > FYI
> >> > > > > > It looks like increasing the number of Zookeeper Quorums can
> >> solve
> >> > > the
> >> > > > > > following error message : org.apache.hadoop.hbase.
> >> > > > > > client.NoServerForRegionException: Timed out trying to locate
> >> root
> >> > > > region
> >> > > > > > at
> >> > > > > > org.apache.hadoop.hbase.
> >> > > > > >
> >> > > > > > You mean quorum members?  How many do you have now?
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > > Now I am running Zookeeper quorum on each node I have.
> >> > > > > > However, I am still having issues about losing regionserver.
> >> > > > > >
> >> > > > > > Whats in the logs?
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > > Is there a way to browse the Znode in zookeeper?
> >> > > > > >
> >> > > > > >
> >> > > > > Type 'zk' in the hbase shell.
> >> > > > >
> >> > > > > You can get to the zk shell from hbase shell.  You so things
> like:
> >> > > > >
> >> > > > > > zk "ls /"
> >> > > > >
> >> > > > > (Yes, quotes needed).
> >> > > > >
> >> > > > > St.Ack
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > > thanks
> >> > > > > > zhenyu
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Wed, Oct 28, 2009 at 3:40 PM, Zhenyu Zhong <
> >> > > [email protected]
> >> > > > > > >wrote:
> >> > > > > >
> >> > > > > > > JG,
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Thanks a lot for the tips.
> >> > > > > > > I set the HEAP to 4GB and GC options as
> >> -XX:ParallelGCThreads=8
> >> > > > > > >  -XX:+UseConcMarkSweepGC.
> >> > > > > > >
> >> > > > > > > I checked the logs in my Master an RS and found the
> following
> >> > > errors.
> >> > > > > > > Basically, master got exception error while scanning ROOT,
> >> then
> >> > the
> >> > > > > ROOT
> >> > > > > > > region was offline and unset.  Thus the regionserver can't
> get
> >> > > > > > > NotservingRegion errors.
> >> > > > > > >
> >> > > > > > > In the master:
> >> > > > > > > 2009-10-28 19:00:30,591 INFO
> >> > > > > org.apache.hadoop.hbase.master.BaseScanner:
> >> > > > > > > RegionManager.rootScanner scanning meta region {server:
> x.x.x.
> >> > > > > > > x:60021, regionname: -ROOT-,,0, startKey: <>}
> >> > > > > > > 2009-10-28 19:00:30,591 WARN
> >> > > > > org.apache.hadoop.hbase.master.BaseScanner:
> >> > > > > > > Scan ROOT region
> >> > > > > > > java.io.IOException: Call to /x.x.x.x:60021 failed on local
> >> > > > exception:
> >> > > > > > > java.io.EOFException
> >> > > > > > >         at
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:757)
> >> > > > > > >         at
> >> > > > > > >
> >> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:727)
> >> > > > > > >         at
> >> > > > > > >
> >> > > >
> >> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
> >> > > > > > >         at $Proxy1.openScanner(Unknown Source)
> >> > > > > > >         at
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160)
> >> > > > > > >         at
> >> > > > > > >
> >> > > > >
> >> > >
> >> org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54)
> >> > > > > > >         at
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79)
> >> > > > > > >         at
> >> > > > > > >
> >> > > >
> >> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136)
> >> > > > > > >         at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
> >> > > > > > > Caused by: java.io.EOFException
> >> > > > > > >         at
> >> > > java.io.DataInputStream.readInt(DataInputStream.java:375)
> >> > > > > > >         at
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:504)
> >> > > > > > >         at
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:448)
> >> > > > > > > 2009-10-28 19:00:30,591 INFO
> >> > > > > org.apache.hadoop.hbase.master.BaseScanner:
> >> > > > > > > RegionManager.metaScanner scanning meta region {server:
> x.x.x.
> >> > > > > > > x:60021, regionname: .META.,,1, startKey: <>}
> >> > > > > > > 2009-10-28 19:00:30,591 WARN
> >> > > > > org.apache.hadoop.hbase.master.BaseScanner:
> >> > > > > > > Scan one META region: {server: x.x.x.x:60021, regionname: .M
> >> > > > > > > ETA.,,1, startKey: <>}
> >> > > > > > > java.net.ConnectException: Connection refused
> >> > > > > > >         at sun.nio.ch.SocketChannelImpl.checkConnect(Native
> >> > Method)
> >> > > > > > >         at
> >> > > > > > >
> >> > > >
> >> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> >> > > > > > >         at
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> >> > > > > > >         at
> >> > > org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
> >> > > > > > >         at
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308)
> >> > > > > > >         at
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831)
> >> > > > > > >         at
> >> > > > > > >
> >> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712)
> >> > > > > > >         at
> >> > > > > > >
> >> > > >
> >> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
> >> > > > > > >         at $Proxy1.openScanner(Unknown Source)
> >> > > > > > >         at
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160)
> >> > > > > > >         at
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
> >> > > > > > >         at
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
> >> > > > > > >         at
> >> > > > > > >
> >> > > >
> >> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136)
> >> > > > > > >         at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
> >> > > > > > > 2009-10-28 19:00:30,591 INFO
> >> > > > > org.apache.hadoop.hbase.master.BaseScanner:
> >> > > > > > > All 1 .META. region(s) scanned
> >> > > > > > > 2009-10-28 19:00:31,395 INFO
> >> > > > > > org.apache.hadoop.hbase.master.ServerManager:
> >> > > > > > > Removing server's info YYYY,60021,125675547057
> >> > > > > > > 0
> >> > > > > > > 2009-10-28 19:00:31,395 INFO
> >> > > > > > org.apache.hadoop.hbase.master.RegionManager:
> >> > > > > > > Offlined ROOT server: x.x.x.x:60021
> >> > > > > > >
> >> > > > > > > 2009-10-28 19:00:31,395 INFO
> >> > > > > > org.apache.hadoop.hbase.master.RegionManager:
> >> > > > > > > -ROOT- region unset (but not set to be reassigned)
> >> > > > > > > 2009-10-28 19:00:31,395 INFO
> >> > > > > > org.apache.hadoop.hbase.master.RegionManager:
> >> > > > > > > ROOT inserted into regionsInTransition
> >> > > > > > > 2009-10-28 19:00:31,395 INFO
> >> > > > > > org.apache.hadoop.hbase.master.RegionManager:
> >> > > > > > > Offlining META region: {server: x.x.x.x:60021, regionname:
> >> > > .META.,,1,
> >> > > > > > > startKey: <>}
> >> > > > > > > 2009-10-28 19:00:31,395 INFO
> >> > > > > > org.apache.hadoop.hbase.master.RegionManager:
> >> > > > > > > META region removed from onlineMetaRegions
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On the regionserver:
> >> > > > > > > 2009-10-28 18:51:14,578 INFO
> >> > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> >> > > MSG_REGION_OPEN:
> >> > > > > > > test,,1256755871065
> >> > > > > > > 2009-10-28 18:51:14,578 INFO
> >> > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
> >> > > > > > MSG_REGION_OPEN:
> >> > > > > > > test,,1256755871065
> >> > > > > > > 2009-10-28 18:51:14,578 INFO
> >> > > > > > org.apache.hadoop.hbase.regionserver.HRegion:
> >> > > > > > > region test,,1256755871065/796855017 available; sequence id
> is
> >> > > > 10013291
> >> > > > > > > 2009-10-28 18:51:14,578 INFO
> >> > > > > > org.apache.hadoop.hbase.regionserver.HRegion:
> >> > > > > > > Starting compaction on region test,,1256755871065
> >> > > > > > > 2009-10-28 18:51:18,388 DEBUG
> org.apache.zookeeper.ClientCnxn:
> >> > Got
> >> > > > ping
> >> > > > > > > response for sessionid:0x249c76021d0001 after 0ms
> >> > > > > > > 2009-10-28 18:51:19,341 ERROR
> >> > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> >> > > > > > > org.apache.hadoop.hbase.NotServingRegionException:
> >> > > > test,,1256754924503
> >> > > > > > >         at
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307)
> >> > > > > > >         at
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784)
> >> > > > > > >         at
> sun.reflect.GeneratedMethodAccessor6.invoke(Unknown
> >> > > > Source)
> >> > > > > > >         at
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> > > > > > >         at java.lang.reflect.Method.invoke(Method.java:597)
> >> > > > > > >         at
> >> > > > > > >
> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
> >> > > > > > >         at
> >> > > > > > >
> >> > > > >
> >> > >
> >>
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> >> > > > > > > 2009-10-28 18:51:19,341 INFO
> >> org.apache.hadoop.ipc.HBaseServer:
> >> > IPC
> >> > > > > > Server
> >> > > > > > > handler 0 on 60021, call get([...@21fefd80, row=1053508149,
> >> > > > > maxVersions=1,
> >> > > > > > > timeRange=[0,9223372036854775807),
> >> > > > families={(family=email_ip_activity,
> >> > > > > > > columns=ALL}) from x.x.x.x:54669: error:
> >> > > > > > > org.apache.hadoop.hbase.NotServingRegionException:
> >> > > > test,,1256754924503
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Wed, Oct 28, 2009 at 2:56 PM, Jonathan Gray <
> >> > [email protected]>
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > >> These client error messages are not particular descriptive
> as
> >> to
> >> > > the
> >> > > > > > root
> >> > > > > > >> cause (they are fatal errors, or close to it).
> >> > > > > > >>
> >> > > > > > >> What is going on in your regionservers when these errors
> >> happen?
> >> > > > >  Check
> >> > > > > > >> the master and RS logs.
> >> > > > > > >>
> >> > > > > > >> Also, you definitely do not want 19 zookeeper nodes.
>  Reduce
> >> > that
> >> > > to
> >> > > > 3
> >> > > > > > or
> >> > > > > > >> 5 max.
> >> > > > > > >>
> >> > > > > > >> What is the hardware you are using for these nodes, and
> what
> >> > > > settings
> >> > > > > do
> >> > > > > > >> you have for heap/GC?
> >> > > > > > >>
> >> > > > > > >> JG
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > > >> Zhenyu Zhong wrote:
> >> > > > > > >>
> >> > > > > > >>> Stack,
> >> > > > > > >>>
> >> > > > > > >>> Thank you very much for your comments.
> >> > > > > > >>> I am running a cluster with 20 nodes. I set 19 as both
> >> > > regionserver
> >> > > > > and
> >> > > > > > >>> zookeeper quorums.
> >> > > > > > >>> The versions I am using are  Hadoop0.20.1 and HBase0.20.1.
> >> > > > > > >>> I started with an empty table and try to load 200 million
> >> > records
> >> > > > > into
> >> > > > > > >>> that
> >> > > > > > >>> empty table.
> >> > > > > > >>> There is a key in each record. Logically, in my MR
> program,
> >> > > during
> >> > > > > the
> >> > > > > > >>> setup, I opened an HTable, in my mapper, I fetch the
> record
> >> > from
> >> > > > > HTable
> >> > > > > > >>> via
> >> > > > > > >>> key in the record, then make some changes to the columns
> and
> >> > > update
> >> > > > > > that
> >> > > > > > >>> row
> >> > > > > > >>> back to HTable through TableOutputFormat by passing a put.
> >> > There
> >> > > is
> >> > > > > no
> >> > > > > > >>> reduce tasks involved here.  (Though it is unnecessary to
> >> fetch
> >> > > row
> >> > > > > > from
> >> > > > > > >>> an
> >> > > > > > >>> empty table, I just intended to do that)
> >> > > > > > >>>
> >> > > > > > >>> Additionally, when I reduced the number of regionservers
> and
> >> > > number
> >> > > > > of
> >> > > > > > >>> zookeeper quorums.
> >> > > > > > >>> I had different errors:
> >> > > > > > >>> org.apache.hadoop.hbase.client.NoServerForRegionException:
> >> > Timed
> >> > > > out
> >> > > > > > >>> trying
> >> > > > > > >>> to locate root region at
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929)
> >> > > > > > >>> at
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580)
> >> > > > > > >>> at
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562)
> >> > > > > > >>> at
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693)
> >> > > > > > >>> at
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589)
> >> > > > > > >>> at
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562)
> >> > > > > > >>> at
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693)
> >> > > > > > >>> at
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593)
> >> > > > > > >>> at
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556)
> >> > > > > > >>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127)
> >> at
> >> > > > > > >>> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116)
> >> > > > > > >>> at
> >> > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573)
> >> > > > at
> >> > > > > > >>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at
> >> > > > > > >>> org.apache.hadoop.mapred.Child.main(Child.java:170)
> >> > > > > > >>>
> >> > > > > > >>> Many thanks in advance.
> >> > > > > > >>> zhenyu
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > > >>> On Wed, Oct 28, 2009 at 12:39 PM, stack <[email protected]
> >
> >> > > wrote:
> >> > > > > > >>>
> >> > > > > > >>>  Whats your cluster topology?  How many nodes involved?
> >>  When
> >> > you
> >> > > > see
> >> > > > > > the
> >> > > > > > >>>> below message, how many regions in your table?  How are
> you
> >> > > > loading
> >> > > > > > your
> >> > > > > > >>>> table?
> >> > > > > > >>>> Thanks,
> >> > > > > > >>>> St.Ack
> >> > > > > > >>>>
> >> > > > > > >>>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong <
> >> > > > > > [email protected]
> >> > > > > > >>>>
> >> > > > > > >>>>> wrote:
> >> > > > > > >>>>> Nitay,
> >> > > > > > >>>>>
> >> > > > > > >>>>> I am very appreciated.
> >> > > > > > >>>>>
> >> > > > > > >>>>> As Ryan suggested, I increased the zookeeper session
> >> timeout
> >> > to
> >> > > > > > >>>>> 40seconds
> >> > > > > > >>>>> along with the GC options -XX:ParallelGCThreads=8
> >> > > > > > >>>>>
> >> > > > > > >>>>  -XX:+UseConcMarkSweepGC
> >> > > > > > >>>>
> >> > > > > > >>>>> in place. I set the Heapsize to 4GB.  I also set the
> >> > > > > vm.swappiness=0.
> >> > > > > > >>>>>
> >> > > > > > >>>>> However it still ran into problem. Please find the
> >> following
> >> > > > > errors.
> >> > > > > > >>>>>
> >> > > > > > >>>>>
> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> >> > > Trying
> >> > > > to
> >> > > > > > >>>>> contact region server x.x.x.x:60021 for region
> >> > > > > > >>>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but
> >> failed
> >> > > > after
> >> > > > > > 10
> >> > > > > > >>>>> attempts.
> >> > > > > > >>>>> Exceptions:
> >> > > > > > >>>>>
> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> >> > > Failed
> >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> >> > > > > > >>>>>
> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> >> > > Failed
> >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> >> > > > > > >>>>>
> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> >> > > Failed
> >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> >> > > > > > >>>>>
> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> >> > > Failed
> >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> >> > > > > > >>>>>
> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> >> > > Failed
> >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> >> > > > > > >>>>>
> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> >> > > Failed
> >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> >> > > > > > >>>>>
> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> >> > > Failed
> >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> >> > > > > > >>>>>
> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> >> > > Failed
> >> > > > > > >>>>> setting up proxy to /x.x.x.:60021 after attempts=1
> >> > > > > > >>>>>
> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> >> > > Failed
> >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> >> > > > > > >>>>>
> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> >> > > Failed
> >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> >> > > > > > >>>>>
> >> > > > > > >>>>>       at
> >> > > > > > >>>>>
> >> > > > > > >>>>>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001)
> >> > > > > > >>>>
> >> > > > > > >>>>>       at
> >> > > > org.apache.hadoop.hbase.client.HTable.get(HTable.java:413)
> >> > > > > > >>>>>
> >> > > > > > >>>>>
> >> > > > > > >>>>> The input file is about 10GB around 200million rows of
> >> data.
> >> > > > > > >>>>> This load doesn't seem too large. However this kind of
> >> errors
> >> > > > keep
> >> > > > > > >>>>>
> >> > > > > > >>>> popping
> >> > > > > > >>>>
> >> > > > > > >>>>> up.
> >> > > > > > >>>>>
> >> > > > > > >>>>> Does Regionserver need to be deployed to dedicated
> >> machines?
> >> > > > > > >>>>> Does Zookeeper need to be deployed to dedicated machines
> >> as
> >> > > well?
> >> > > > > > >>>>>
> >> > > > > > >>>>> Best,
> >> > > > > > >>>>> zhenyu
> >> > > > > > >>>>>
> >> > > > > > >>>>>
> >> > > > > > >>>>>
> >> > > > > > >>>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <
> [email protected]>
> >> > > wrote:
> >> > > > > > >>>>>
> >> > > > > > >>>>>  Hi Zhenyu,
> >> > > > > > >>>>>>
> >> > > > > > >>>>>> Sorry for the delay. I started working on this a while
> >> back,
> >> > > > > before
> >> > > > > > I
> >> > > > > > >>>>>>
> >> > > > > > >>>>> left
> >> > > > > > >>>>>
> >> > > > > > >>>>>> my job for another company. Since then I haven't had
> much
> >> > time
> >> > > > to
> >> > > > > > work
> >> > > > > > >>>>>>
> >> > > > > > >>>>> on
> >> > > > > > >>>>
> >> > > > > > >>>>> HBase unfortunately :(. I'll try to dig up what I had
> and
> >> see
> >> > > > what
> >> > > > > > >>>>>>
> >> > > > > > >>>>> shape
> >> > > > > > >>>>
> >> > > > > > >>>>> it's in and update you.
> >> > > > > > >>>>>>
> >> > > > > > >>>>>> Cheers,
> >> > > > > > >>>>>> -n
> >> > > > > > >>>>>>
> >> > > > > > >>>>>>
> >> > > > > > >>>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote:
> >> > > > > > >>>>>>
> >> > > > > > >>>>>>  Sorry I must have mistyped, I meant to say "40
> seconds".
> >> >  You
> >> > > > can
> >> > > > > > >>>>>>
> >> > > > > > >>>>>>> still see multi-second pauses at times, so you need to
> >> give
> >> > > > > > yourself
> >> > > > > > >>>>>>> a
> >> > > > > > >>>>>>> bigger buffer.
> >> > > > > > >>>>>>>
> >> > > > > > >>>>>>> The parallel threads argument should not be necessary,
> >> but
> >> > > you
> >> > > > do
> >> > > > > > >>>>>>> need
> >> > > > > > >>>>>>> the UseConcMarkSweepGC flag as well.
> >> > > > > > >>>>>>>
> >> > > > > > >>>>>>> Let us know how it goes!
> >> > > > > > >>>>>>> -ryan
> >> > > > > > >>>>>>>
> >> > > > > > >>>>>>>
> >> > > > > > >>>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong <
> >> > > > > > >>>>>>>
> >> > > > > > >>>>>> [email protected]>
> >> > > > > > >>>>
> >> > > > > > >>>>>  wrote:
> >> > > > > > >>>>>>>
> >> > > > > > >>>>>>>  Ryan,
> >> > > > > > >>>>>>>> I am very appreciated for your feedbacks.
> >> > > > > > >>>>>>>> I have set the zookeeper.session.timeout to seconds
> >> which
> >> > is
> >> > > > way
> >> > > > > > >>>>>>>>
> >> > > > > > >>>>>>> higher
> >> > > > > > >>>>
> >> > > > > > >>>>>  than
> >> > > > > > >>>>>>>> 40ms.
> >> > > > > > >>>>>>>> In the same time, the -Xms is set to 4GB, which
> should
> >> be
> >> > > > > > >>>>>>>> sufficient.
> >> > > > > > >>>>>>>> I also tried GC options like
> >> > > > > > >>>>>>>>
> >> > > > > > >>>>>>>>  -XX:ParallelGCThreads=8
> >> > > > > > >>>>>>>> -XX:+UseConcMarkSweepGC
> >> > > > > > >>>>>>>>
> >> > > > > > >>>>>>>> I even set the vm.swappiness=0
> >> > > > > > >>>>>>>>
> >> > > > > > >>>>>>>> However, I still came across the problem that a
> >> > RegionServer
> >> > > > > > >>>>>>>> shutdown
> >> > > > > > >>>>>>>> itself.
> >> > > > > > >>>>>>>>
> >> > > > > > >>>>>>>> Best,
> >> > > > > > >>>>>>>> zhong
> >> > > > > > >>>>>>>>
> >> > > > > > >>>>>>>>
> >> > > > > > >>>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <
> >> > > > > [email protected]>
> >> > > > > > >>>>>>>>
> >> > > > > > >>>>>>> wrote:
> >> > > > > > >>>>>
> >> > > > > > >>>>>>   Set the ZK timeout to something like 40ms, and give
> the
> >> GC
> >> > > > > enough
> >> > > > > > >>>>>>>>
> >> > > > > > >>>>>>> Xmx
> >> > > > > > >>>>
> >> > > > > > >>>>>  so you never risk entering the much dreaded
> >> > > > > concurrent-mode-failure
> >> > > > > > >>>>>>>>> whereby the entire heap must be GCed.
> >> > > > > > >>>>>>>>>
> >> > > > > > >>>>>>>>> Consider testing Java 7 and the G1 GC.
> >> > > > > > >>>>>>>>>
> >> > > > > > >>>>>>>>> We could get a JNI thread to do this, but no one has
> >> done
> >> > > so
> >> > > > > yet.
> >> > > > > > I
> >> > > > > > >>>>>>>>>
> >> > > > > > >>>>>>>> am
> >> > > > > > >>>>
> >> > > > > > >>>>>  personally hoping for G1 and in the meantime
> >> overprovision
> >> > our
> >> > > > Xmx
> >> > > > > > >>>>>>>>>
> >> > > > > > >>>>>>>> to
> >> > > > > > >>>>
> >> > > > > > >>>>>  avoid the concurrent mode failures.
> >> > > > > > >>>>>>>>>
> >> > > > > > >>>>>>>>> -ryan
> >> > > > > > >>>>>>>>>
> >> > > > > > >>>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong <
> >> > > > > > >>>>>>>>>
> >> > > > > > >>>>>>>> [email protected]>
> >> > > > > > >>>>>
> >> > > > > > >>>>>>  wrote:
> >> > > > > > >>>>>>>>>
> >> > > > > > >>>>>>>>>  Ryan,
> >> > > > > > >>>>>>>>>>
> >> > > > > > >>>>>>>>>> Thank you very much.
> >> > > > > > >>>>>>>>>> May I ask whether there are any ways to get around
> >> this
> >> > > > > problem
> >> > > > > > to
> >> > > > > > >>>>>>>>>>
> >> > > > > > >>>>>>>>> make
> >> > > > > > >>>>>
> >> > > > > > >>>>>>  HBase more stable?
> >> > > > > > >>>>>>>>>>
> >> > > > > > >>>>>>>>>> best,
> >> > > > > > >>>>>>>>>> zhong
> >> > > > > > >>>>>>>>>>
> >> > > > > > >>>>>>>>>>
> >> > > > > > >>>>>>>>>>
> >> > > > > > >>>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <
> >> > > > > > [email protected]>
> >> > > > > > >>>>>>>>>> wrote:
> >> > > > > > >>>>>>>>>>
> >> > > > > > >>>>>>>>>>  There isnt any working code yet. Just an idea, and
> a
> >> > > > > prototype.
> >> > > > > > >>>>>>>>>>
> >> > > > > > >>>>>>>>>>> There is some sense that if we can get the G1 GC
> >> that
> >> > we
> >> > > > > could
> >> > > > > > >>>>>>>>>>> get
> >> > > > > > >>>>>>>>>>>
> >> > > > > > >>>>>>>>>> rid
> >> > > > > > >>>>>
> >> > > > > > >>>>>>  of all long pauses, and avoid the need for this.
> >> > > > > > >>>>>>>>>>>
> >> > > > > > >>>>>>>>>>> -ryan
> >> > > > > > >>>>>>>>>>>
> >> > > > > > >>>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong <
> >> > > > > > >>>>>>>>>>> [email protected]>
> >> > > > > > >>>>>>>>>>> wrote:
> >> > > > > > >>>>>>>>>>>
> >> > > > > > >>>>>>>>>>>  Hi,
> >> > > > > > >>>>>>>>>>>>
> >> > > > > > >>>>>>>>>>>> I am very interesting to the solution that Joey
> >> > proposed
> >> > > > and
> >> > > > > > >>>>>>>>>>>>
> >> > > > > > >>>>>>>>>>> would
> >> > > > > > >>>>
> >> > > > > > >>>>>   like
> >> > > > > > >>>>>>>>>>>
> >> > > > > > >>>>>>>>>> to
> >> > > > > > >>>>>>>>>>
> >> > > > > > >>>>>>>>>>> have a try.
> >> > > > > > >>>>>>>>>>>> Does anyone have any ideas on how to deploy this
> >> > > > zk_wrapper
> >> > > > > in
> >> > > > > > >>>>>>>>>>>>
> >> > > > > > >>>>>>>>>>> JNI
> >> > > > > > >>>>
> >> > > > > > >>>>>   integration?
> >> > > > > > >>>>>>>>>>>>
> >> > > > > > >>>>>>>>>>>> I would be very appreciated.
> >> > > > > > >>>>>>>>>>>>
> >> > > > > > >>>>>>>>>>>> thanks
> >> > > > > > >>>>>>>>>>>> zhong
> >> > > > > > >>>>>>>>>>>>
> >> > > > > > >>>>>>>>>>>>
> >> > > > > > >>>>>>>>>>>>
> >> > > > > > >>>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)

Reply via email to