You can restart the regionserver and it'll rejoin the cluster. Master will assign it regions again. We use to autorestart but was taken out; thought is some baby-sitter watching the process would be better at this. St.Ack
On Thu, Oct 29, 2009 at 3:35 PM, Zhenyu Zhong <[email protected]>wrote: > BTW, if a regionserver disconnect, would a mannually restart on this > particular regionserver helps ? > > Best, > zhenyu > > > On Thu, Oct 29, 2009 at 6:26 PM, Zhenyu Zhong <[email protected] > >wrote: > > > Stack, > > > > I am very appreciated for your help. > > Yes, I am using cacti to monitor the loads etc. I also upped my zk > seesion > > timeout to 600sec. > > May I ask what the default connection timeout for a zookeeper client to > > connect to a quorum? > > > > thanks > > zhenyu > > > > > > > > On Thu, Oct 29, 2009 at 6:06 PM, stack <[email protected]> wrote: > > > >> If it stole machine resources, yeah, it could. Do you have anything to > >> watch your cluster with in place? Ganglia or some such so you can watch > >> the > >> loadings? Is the machine with the RS that is going down swapping? You > >> could try upping your zk session timeout in your hbase cluster. > >> St.Ack > >> > >> On Thu, Oct 29, 2009 at 3:00 PM, Zhenyu Zhong <[email protected] > >> >wrote: > >> > >> > Anything that possibly gets started is another MR job working on other > >> > dataset in the same time as this test was running. So some node might > >> be > >> > under heavy loads. > >> > I am wondering whether that would cause the connection timeout. > >> > > >> > thanks > >> > zhenyu > >> > > >> > > >> > > >> > On Thu, Oct 29, 2009 at 5:32 PM, stack <[email protected]> wrote: > >> > > >> > > On Thu, Oct 29, 2009 at 2:23 PM, Zhenyu Zhong < > >> [email protected] > >> > > >wrote: > >> > > > >> > > > I have 19 quorum members now. > >> > > > > >> > > > Thats too many. Have 3 or maybe 5. See zk site for rationale. > >> > > > >> > > > >> > > > >> > > > When I did test on loading data to two columnfamilies of one table > >> in > >> > > HBase > >> > > > using two seperate MR jobs, I lost my regionserver and the test > >> failed. > >> > > > > >> > > > Does HBase allow such table update operation? > >> > > > > >> > > > The errors I got while I lost my regionserver is: > >> > > > 2009-10-29 21:09:34,705 INFO > >> org.apache.hadoop.hbase.regionserver.HLog: > >> > > > Roll > >> > > > /hbase/.logs/YYYY,60021,1256849619429/hlog.d > >> > > > at.1256849620029, entries=271911, calcsize=63754142, > >> filesize=33975611. > >> > > New > >> > > > hlog /hbase/.logs/YYYY,60021,1256849619429/hl > >> > > > og.dat.1256850574705 > >> > > > 2009-10-29 21:09:50,322 WARN > >> > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1 > >> > > > org.apache.hadoop.hbase.Leases$LeaseStillHeldException > >> > > > > >> > > > >> > > > >> > > You have read the 'Getting Started' and made the necessary changes > to > >> > > filedescriptors and xceiver count? > >> > > > >> > > You will see above message after a regionserver has restarted and > >> tries > >> > to > >> > > go back to the master (what hbase is this? I think you said it > >> 0.20.x). > >> > > > >> > > > >> > > > >> > > > >> > > > java.io.IOException: TIMED OUT > >> > > > at > >> > > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) > >> > > > 2009-10-29 21:09:50,873 INFO > >> > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper > >> > event, > >> > > > state: Disconnected, type: None, path: > >> > > > null > >> > > > > >> > > > >> > > This is timeout against zk. You've lost your session. The RS will > go > >> > > down. The connection to zk is basic to hbase. Something is up. In > >> the > >> > > past others have reported things like incorrect bios settings on > disks > >> > that > >> > > have made the disks run slow or just something up with the > networking. > >> > Can > >> > > you check all is healthy? You seem to be having too many issues for > >> such > >> > a > >> > > small loading with such a large cluster. > >> > > > >> > > St.Ack > >> > > > >> > > > >> > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > On Thu, Oct 29, 2009 at 2:51 PM, stack <[email protected]> wrote: > >> > > > > >> > > > > On Thu, Oct 29, 2009 at 11:46 AM, Zhenyu Zhong < > >> > > [email protected] > >> > > > > >wrote: > >> > > > > > >> > > > > > FYI > >> > > > > > It looks like increasing the number of Zookeeper Quorums can > >> solve > >> > > the > >> > > > > > following error message : org.apache.hadoop.hbase. > >> > > > > > client.NoServerForRegionException: Timed out trying to locate > >> root > >> > > > region > >> > > > > > at > >> > > > > > org.apache.hadoop.hbase. > >> > > > > > > >> > > > > > You mean quorum members? How many do you have now? > >> > > > > > >> > > > > > >> > > > > > >> > > > > > Now I am running Zookeeper quorum on each node I have. > >> > > > > > However, I am still having issues about losing regionserver. > >> > > > > > > >> > > > > > Whats in the logs? > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > Is there a way to browse the Znode in zookeeper? > >> > > > > > > >> > > > > > > >> > > > > Type 'zk' in the hbase shell. > >> > > > > > >> > > > > You can get to the zk shell from hbase shell. You so things > like: > >> > > > > > >> > > > > > zk "ls /" > >> > > > > > >> > > > > (Yes, quotes needed). > >> > > > > > >> > > > > St.Ack > >> > > > > > >> > > > > > >> > > > > > >> > > > > > thanks > >> > > > > > zhenyu > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > On Wed, Oct 28, 2009 at 3:40 PM, Zhenyu Zhong < > >> > > [email protected] > >> > > > > > >wrote: > >> > > > > > > >> > > > > > > JG, > >> > > > > > > > >> > > > > > > > >> > > > > > > Thanks a lot for the tips. > >> > > > > > > I set the HEAP to 4GB and GC options as > >> -XX:ParallelGCThreads=8 > >> > > > > > > -XX:+UseConcMarkSweepGC. > >> > > > > > > > >> > > > > > > I checked the logs in my Master an RS and found the > following > >> > > errors. > >> > > > > > > Basically, master got exception error while scanning ROOT, > >> then > >> > the > >> > > > > ROOT > >> > > > > > > region was offline and unset. Thus the regionserver can't > get > >> > > > > > > NotservingRegion errors. > >> > > > > > > > >> > > > > > > In the master: > >> > > > > > > 2009-10-28 19:00:30,591 INFO > >> > > > > org.apache.hadoop.hbase.master.BaseScanner: > >> > > > > > > RegionManager.rootScanner scanning meta region {server: > x.x.x. > >> > > > > > > x:60021, regionname: -ROOT-,,0, startKey: <>} > >> > > > > > > 2009-10-28 19:00:30,591 WARN > >> > > > > org.apache.hadoop.hbase.master.BaseScanner: > >> > > > > > > Scan ROOT region > >> > > > > > > java.io.IOException: Call to /x.x.x.x:60021 failed on local > >> > > > exception: > >> > > > > > > java.io.EOFException > >> > > > > > > at > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:757) > >> > > > > > > at > >> > > > > > > > >> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:727) > >> > > > > > > at > >> > > > > > > > >> > > > > >> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > >> > > > > > > at $Proxy1.openScanner(Unknown Source) > >> > > > > > > at > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > >> > > > > > > at > >> > > > > > > > >> > > > > > >> > > > >> org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54) > >> > > > > > > at > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79) > >> > > > > > > at > >> > > > > > > > >> > > > > >> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > >> > > > > > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > >> > > > > > > Caused by: java.io.EOFException > >> > > > > > > at > >> > > java.io.DataInputStream.readInt(DataInputStream.java:375) > >> > > > > > > at > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:504) > >> > > > > > > at > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:448) > >> > > > > > > 2009-10-28 19:00:30,591 INFO > >> > > > > org.apache.hadoop.hbase.master.BaseScanner: > >> > > > > > > RegionManager.metaScanner scanning meta region {server: > x.x.x. > >> > > > > > > x:60021, regionname: .META.,,1, startKey: <>} > >> > > > > > > 2009-10-28 19:00:30,591 WARN > >> > > > > org.apache.hadoop.hbase.master.BaseScanner: > >> > > > > > > Scan one META region: {server: x.x.x.x:60021, regionname: .M > >> > > > > > > ETA.,,1, startKey: <>} > >> > > > > > > java.net.ConnectException: Connection refused > >> > > > > > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native > >> > Method) > >> > > > > > > at > >> > > > > > > > >> > > > > >> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) > >> > > > > > > at > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > >> > > > > > > at > >> > > org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) > >> > > > > > > at > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308) > >> > > > > > > at > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831) > >> > > > > > > at > >> > > > > > > > >> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712) > >> > > > > > > at > >> > > > > > > > >> > > > > >> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > >> > > > > > > at $Proxy1.openScanner(Unknown Source) > >> > > > > > > at > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > >> > > > > > > at > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73) > >> > > > > > > at > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129) > >> > > > > > > at > >> > > > > > > > >> > > > > >> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > >> > > > > > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > >> > > > > > > 2009-10-28 19:00:30,591 INFO > >> > > > > org.apache.hadoop.hbase.master.BaseScanner: > >> > > > > > > All 1 .META. region(s) scanned > >> > > > > > > 2009-10-28 19:00:31,395 INFO > >> > > > > > org.apache.hadoop.hbase.master.ServerManager: > >> > > > > > > Removing server's info YYYY,60021,125675547057 > >> > > > > > > 0 > >> > > > > > > 2009-10-28 19:00:31,395 INFO > >> > > > > > org.apache.hadoop.hbase.master.RegionManager: > >> > > > > > > Offlined ROOT server: x.x.x.x:60021 > >> > > > > > > > >> > > > > > > 2009-10-28 19:00:31,395 INFO > >> > > > > > org.apache.hadoop.hbase.master.RegionManager: > >> > > > > > > -ROOT- region unset (but not set to be reassigned) > >> > > > > > > 2009-10-28 19:00:31,395 INFO > >> > > > > > org.apache.hadoop.hbase.master.RegionManager: > >> > > > > > > ROOT inserted into regionsInTransition > >> > > > > > > 2009-10-28 19:00:31,395 INFO > >> > > > > > org.apache.hadoop.hbase.master.RegionManager: > >> > > > > > > Offlining META region: {server: x.x.x.x:60021, regionname: > >> > > .META.,,1, > >> > > > > > > startKey: <>} > >> > > > > > > 2009-10-28 19:00:31,395 INFO > >> > > > > > org.apache.hadoop.hbase.master.RegionManager: > >> > > > > > > META region removed from onlineMetaRegions > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > On the regionserver: > >> > > > > > > 2009-10-28 18:51:14,578 INFO > >> > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: > >> > > MSG_REGION_OPEN: > >> > > > > > > test,,1256755871065 > >> > > > > > > 2009-10-28 18:51:14,578 INFO > >> > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: > >> > > > > > MSG_REGION_OPEN: > >> > > > > > > test,,1256755871065 > >> > > > > > > 2009-10-28 18:51:14,578 INFO > >> > > > > > org.apache.hadoop.hbase.regionserver.HRegion: > >> > > > > > > region test,,1256755871065/796855017 available; sequence id > is > >> > > > 10013291 > >> > > > > > > 2009-10-28 18:51:14,578 INFO > >> > > > > > org.apache.hadoop.hbase.regionserver.HRegion: > >> > > > > > > Starting compaction on region test,,1256755871065 > >> > > > > > > 2009-10-28 18:51:18,388 DEBUG > org.apache.zookeeper.ClientCnxn: > >> > Got > >> > > > ping > >> > > > > > > response for sessionid:0x249c76021d0001 after 0ms > >> > > > > > > 2009-10-28 18:51:19,341 ERROR > >> > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: > >> > > > > > > org.apache.hadoop.hbase.NotServingRegionException: > >> > > > test,,1256754924503 > >> > > > > > > at > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307) > >> > > > > > > at > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784) > >> > > > > > > at > sun.reflect.GeneratedMethodAccessor6.invoke(Unknown > >> > > > Source) > >> > > > > > > at > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >> > > > > > > at java.lang.reflect.Method.invoke(Method.java:597) > >> > > > > > > at > >> > > > > > > > >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) > >> > > > > > > at > >> > > > > > > > >> > > > > > >> > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) > >> > > > > > > 2009-10-28 18:51:19,341 INFO > >> org.apache.hadoop.ipc.HBaseServer: > >> > IPC > >> > > > > > Server > >> > > > > > > handler 0 on 60021, call get([...@21fefd80, row=1053508149, > >> > > > > maxVersions=1, > >> > > > > > > timeRange=[0,9223372036854775807), > >> > > > families={(family=email_ip_activity, > >> > > > > > > columns=ALL}) from x.x.x.x:54669: error: > >> > > > > > > org.apache.hadoop.hbase.NotServingRegionException: > >> > > > test,,1256754924503 > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > On Wed, Oct 28, 2009 at 2:56 PM, Jonathan Gray < > >> > [email protected]> > >> > > > > > wrote: > >> > > > > > > > >> > > > > > >> These client error messages are not particular descriptive > as > >> to > >> > > the > >> > > > > > root > >> > > > > > >> cause (they are fatal errors, or close to it). > >> > > > > > >> > >> > > > > > >> What is going on in your regionservers when these errors > >> happen? > >> > > > > Check > >> > > > > > >> the master and RS logs. > >> > > > > > >> > >> > > > > > >> Also, you definitely do not want 19 zookeeper nodes. > Reduce > >> > that > >> > > to > >> > > > 3 > >> > > > > > or > >> > > > > > >> 5 max. > >> > > > > > >> > >> > > > > > >> What is the hardware you are using for these nodes, and > what > >> > > > settings > >> > > > > do > >> > > > > > >> you have for heap/GC? > >> > > > > > >> > >> > > > > > >> JG > >> > > > > > >> > >> > > > > > >> > >> > > > > > >> Zhenyu Zhong wrote: > >> > > > > > >> > >> > > > > > >>> Stack, > >> > > > > > >>> > >> > > > > > >>> Thank you very much for your comments. > >> > > > > > >>> I am running a cluster with 20 nodes. I set 19 as both > >> > > regionserver > >> > > > > and > >> > > > > > >>> zookeeper quorums. > >> > > > > > >>> The versions I am using are Hadoop0.20.1 and HBase0.20.1. > >> > > > > > >>> I started with an empty table and try to load 200 million > >> > records > >> > > > > into > >> > > > > > >>> that > >> > > > > > >>> empty table. > >> > > > > > >>> There is a key in each record. Logically, in my MR > program, > >> > > during > >> > > > > the > >> > > > > > >>> setup, I opened an HTable, in my mapper, I fetch the > record > >> > from > >> > > > > HTable > >> > > > > > >>> via > >> > > > > > >>> key in the record, then make some changes to the columns > and > >> > > update > >> > > > > > that > >> > > > > > >>> row > >> > > > > > >>> back to HTable through TableOutputFormat by passing a put. > >> > There > >> > > is > >> > > > > no > >> > > > > > >>> reduce tasks involved here. (Though it is unnecessary to > >> fetch > >> > > row > >> > > > > > from > >> > > > > > >>> an > >> > > > > > >>> empty table, I just intended to do that) > >> > > > > > >>> > >> > > > > > >>> Additionally, when I reduced the number of regionservers > and > >> > > number > >> > > > > of > >> > > > > > >>> zookeeper quorums. > >> > > > > > >>> I had different errors: > >> > > > > > >>> org.apache.hadoop.hbase.client.NoServerForRegionException: > >> > Timed > >> > > > out > >> > > > > > >>> trying > >> > > > > > >>> to locate root region at > >> > > > > > >>> > >> > > > > > >>> > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929) > >> > > > > > >>> at > >> > > > > > >>> > >> > > > > > >>> > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580) > >> > > > > > >>> at > >> > > > > > >>> > >> > > > > > >>> > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > >> > > > > > >>> at > >> > > > > > >>> > >> > > > > > >>> > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > >> > > > > > >>> at > >> > > > > > >>> > >> > > > > > >>> > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589) > >> > > > > > >>> at > >> > > > > > >>> > >> > > > > > >>> > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > >> > > > > > >>> at > >> > > > > > >>> > >> > > > > > >>> > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > >> > > > > > >>> at > >> > > > > > >>> > >> > > > > > >>> > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593) > >> > > > > > >>> at > >> > > > > > >>> > >> > > > > > >>> > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556) > >> > > > > > >>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) > >> at > >> > > > > > >>> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at > >> > > > > > >>> > >> > > > > > >>> > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116) > >> > > > > > >>> at > >> > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) > >> > > > at > >> > > > > > >>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at > >> > > > > > >>> org.apache.hadoop.mapred.Child.main(Child.java:170) > >> > > > > > >>> > >> > > > > > >>> Many thanks in advance. > >> > > > > > >>> zhenyu > >> > > > > > >>> > >> > > > > > >>> > >> > > > > > >>> > >> > > > > > >>> > >> > > > > > >>> On Wed, Oct 28, 2009 at 12:39 PM, stack <[email protected] > > > >> > > wrote: > >> > > > > > >>> > >> > > > > > >>> Whats your cluster topology? How many nodes involved? > >> When > >> > you > >> > > > see > >> > > > > > the > >> > > > > > >>>> below message, how many regions in your table? How are > you > >> > > > loading > >> > > > > > your > >> > > > > > >>>> table? > >> > > > > > >>>> Thanks, > >> > > > > > >>>> St.Ack > >> > > > > > >>>> > >> > > > > > >>>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong < > >> > > > > > [email protected] > >> > > > > > >>>> > >> > > > > > >>>>> wrote: > >> > > > > > >>>>> Nitay, > >> > > > > > >>>>> > >> > > > > > >>>>> I am very appreciated. > >> > > > > > >>>>> > >> > > > > > >>>>> As Ryan suggested, I increased the zookeeper session > >> timeout > >> > to > >> > > > > > >>>>> 40seconds > >> > > > > > >>>>> along with the GC options -XX:ParallelGCThreads=8 > >> > > > > > >>>>> > >> > > > > > >>>> -XX:+UseConcMarkSweepGC > >> > > > > > >>>> > >> > > > > > >>>>> in place. I set the Heapsize to 4GB. I also set the > >> > > > > vm.swappiness=0. > >> > > > > > >>>>> > >> > > > > > >>>>> However it still ran into problem. Please find the > >> following > >> > > > > errors. > >> > > > > > >>>>> > >> > > > > > >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: > >> > > Trying > >> > > > to > >> > > > > > >>>>> contact region server x.x.x.x:60021 for region > >> > > > > > >>>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but > >> failed > >> > > > after > >> > > > > > 10 > >> > > > > > >>>>> attempts. > >> > > > > > >>>>> Exceptions: > >> > > > > > >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: > >> > > Failed > >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >> > > > > > >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: > >> > > Failed > >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >> > > > > > >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: > >> > > Failed > >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >> > > > > > >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: > >> > > Failed > >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >> > > > > > >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: > >> > > Failed > >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >> > > > > > >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: > >> > > Failed > >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >> > > > > > >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: > >> > > Failed > >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >> > > > > > >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: > >> > > Failed > >> > > > > > >>>>> setting up proxy to /x.x.x.:60021 after attempts=1 > >> > > > > > >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: > >> > > Failed > >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >> > > > > > >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: > >> > > Failed > >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >> > > > > > >>>>> > >> > > > > > >>>>> at > >> > > > > > >>>>> > >> > > > > > >>>>> > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) > >> > > > > > >>>> > >> > > > > > >>>>> at > >> > > > org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) > >> > > > > > >>>>> > >> > > > > > >>>>> > >> > > > > > >>>>> The input file is about 10GB around 200million rows of > >> data. > >> > > > > > >>>>> This load doesn't seem too large. However this kind of > >> errors > >> > > > keep > >> > > > > > >>>>> > >> > > > > > >>>> popping > >> > > > > > >>>> > >> > > > > > >>>>> up. > >> > > > > > >>>>> > >> > > > > > >>>>> Does Regionserver need to be deployed to dedicated > >> machines? > >> > > > > > >>>>> Does Zookeeper need to be deployed to dedicated machines > >> as > >> > > well? > >> > > > > > >>>>> > >> > > > > > >>>>> Best, > >> > > > > > >>>>> zhenyu > >> > > > > > >>>>> > >> > > > > > >>>>> > >> > > > > > >>>>> > >> > > > > > >>>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay < > [email protected]> > >> > > wrote: > >> > > > > > >>>>> > >> > > > > > >>>>> Hi Zhenyu, > >> > > > > > >>>>>> > >> > > > > > >>>>>> Sorry for the delay. I started working on this a while > >> back, > >> > > > > before > >> > > > > > I > >> > > > > > >>>>>> > >> > > > > > >>>>> left > >> > > > > > >>>>> > >> > > > > > >>>>>> my job for another company. Since then I haven't had > much > >> > time > >> > > > to > >> > > > > > work > >> > > > > > >>>>>> > >> > > > > > >>>>> on > >> > > > > > >>>> > >> > > > > > >>>>> HBase unfortunately :(. I'll try to dig up what I had > and > >> see > >> > > > what > >> > > > > > >>>>>> > >> > > > > > >>>>> shape > >> > > > > > >>>> > >> > > > > > >>>>> it's in and update you. > >> > > > > > >>>>>> > >> > > > > > >>>>>> Cheers, > >> > > > > > >>>>>> -n > >> > > > > > >>>>>> > >> > > > > > >>>>>> > >> > > > > > >>>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: > >> > > > > > >>>>>> > >> > > > > > >>>>>> Sorry I must have mistyped, I meant to say "40 > seconds". > >> > You > >> > > > can > >> > > > > > >>>>>> > >> > > > > > >>>>>>> still see multi-second pauses at times, so you need to > >> give > >> > > > > > yourself > >> > > > > > >>>>>>> a > >> > > > > > >>>>>>> bigger buffer. > >> > > > > > >>>>>>> > >> > > > > > >>>>>>> The parallel threads argument should not be necessary, > >> but > >> > > you > >> > > > do > >> > > > > > >>>>>>> need > >> > > > > > >>>>>>> the UseConcMarkSweepGC flag as well. > >> > > > > > >>>>>>> > >> > > > > > >>>>>>> Let us know how it goes! > >> > > > > > >>>>>>> -ryan > >> > > > > > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > > >>>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong < > >> > > > > > >>>>>>> > >> > > > > > >>>>>> [email protected]> > >> > > > > > >>>> > >> > > > > > >>>>> wrote: > >> > > > > > >>>>>>> > >> > > > > > >>>>>>> Ryan, > >> > > > > > >>>>>>>> I am very appreciated for your feedbacks. > >> > > > > > >>>>>>>> I have set the zookeeper.session.timeout to seconds > >> which > >> > is > >> > > > way > >> > > > > > >>>>>>>> > >> > > > > > >>>>>>> higher > >> > > > > > >>>> > >> > > > > > >>>>> than > >> > > > > > >>>>>>>> 40ms. > >> > > > > > >>>>>>>> In the same time, the -Xms is set to 4GB, which > should > >> be > >> > > > > > >>>>>>>> sufficient. > >> > > > > > >>>>>>>> I also tried GC options like > >> > > > > > >>>>>>>> > >> > > > > > >>>>>>>> -XX:ParallelGCThreads=8 > >> > > > > > >>>>>>>> -XX:+UseConcMarkSweepGC > >> > > > > > >>>>>>>> > >> > > > > > >>>>>>>> I even set the vm.swappiness=0 > >> > > > > > >>>>>>>> > >> > > > > > >>>>>>>> However, I still came across the problem that a > >> > RegionServer > >> > > > > > >>>>>>>> shutdown > >> > > > > > >>>>>>>> itself. > >> > > > > > >>>>>>>> > >> > > > > > >>>>>>>> Best, > >> > > > > > >>>>>>>> zhong > >> > > > > > >>>>>>>> > >> > > > > > >>>>>>>> > >> > > > > > >>>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson < > >> > > > > [email protected]> > >> > > > > > >>>>>>>> > >> > > > > > >>>>>>> wrote: > >> > > > > > >>>>> > >> > > > > > >>>>>> Set the ZK timeout to something like 40ms, and give > the > >> GC > >> > > > > enough > >> > > > > > >>>>>>>> > >> > > > > > >>>>>>> Xmx > >> > > > > > >>>> > >> > > > > > >>>>> so you never risk entering the much dreaded > >> > > > > concurrent-mode-failure > >> > > > > > >>>>>>>>> whereby the entire heap must be GCed. > >> > > > > > >>>>>>>>> > >> > > > > > >>>>>>>>> Consider testing Java 7 and the G1 GC. > >> > > > > > >>>>>>>>> > >> > > > > > >>>>>>>>> We could get a JNI thread to do this, but no one has > >> done > >> > > so > >> > > > > yet. > >> > > > > > I > >> > > > > > >>>>>>>>> > >> > > > > > >>>>>>>> am > >> > > > > > >>>> > >> > > > > > >>>>> personally hoping for G1 and in the meantime > >> overprovision > >> > our > >> > > > Xmx > >> > > > > > >>>>>>>>> > >> > > > > > >>>>>>>> to > >> > > > > > >>>> > >> > > > > > >>>>> avoid the concurrent mode failures. > >> > > > > > >>>>>>>>> > >> > > > > > >>>>>>>>> -ryan > >> > > > > > >>>>>>>>> > >> > > > > > >>>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong < > >> > > > > > >>>>>>>>> > >> > > > > > >>>>>>>> [email protected]> > >> > > > > > >>>>> > >> > > > > > >>>>>> wrote: > >> > > > > > >>>>>>>>> > >> > > > > > >>>>>>>>> Ryan, > >> > > > > > >>>>>>>>>> > >> > > > > > >>>>>>>>>> Thank you very much. > >> > > > > > >>>>>>>>>> May I ask whether there are any ways to get around > >> this > >> > > > > problem > >> > > > > > to > >> > > > > > >>>>>>>>>> > >> > > > > > >>>>>>>>> make > >> > > > > > >>>>> > >> > > > > > >>>>>> HBase more stable? > >> > > > > > >>>>>>>>>> > >> > > > > > >>>>>>>>>> best, > >> > > > > > >>>>>>>>>> zhong > >> > > > > > >>>>>>>>>> > >> > > > > > >>>>>>>>>> > >> > > > > > >>>>>>>>>> > >> > > > > > >>>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson < > >> > > > > > [email protected]> > >> > > > > > >>>>>>>>>> wrote: > >> > > > > > >>>>>>>>>> > >> > > > > > >>>>>>>>>> There isnt any working code yet. Just an idea, and > a > >> > > > > prototype. > >> > > > > > >>>>>>>>>> > >> > > > > > >>>>>>>>>>> There is some sense that if we can get the G1 GC > >> that > >> > we > >> > > > > could > >> > > > > > >>>>>>>>>>> get > >> > > > > > >>>>>>>>>>> > >> > > > > > >>>>>>>>>> rid > >> > > > > > >>>>> > >> > > > > > >>>>>> of all long pauses, and avoid the need for this. > >> > > > > > >>>>>>>>>>> > >> > > > > > >>>>>>>>>>> -ryan > >> > > > > > >>>>>>>>>>> > >> > > > > > >>>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < > >> > > > > > >>>>>>>>>>> [email protected]> > >> > > > > > >>>>>>>>>>> wrote: > >> > > > > > >>>>>>>>>>> > >> > > > > > >>>>>>>>>>> Hi, > >> > > > > > >>>>>>>>>>>> > >> > > > > > >>>>>>>>>>>> I am very interesting to the solution that Joey > >> > proposed > >> > > > and > >> > > > > > >>>>>>>>>>>> > >> > > > > > >>>>>>>>>>> would > >> > > > > > >>>> > >> > > > > > >>>>> like > >> > > > > > >>>>>>>>>>> > >> > > > > > >>>>>>>>>> to > >> > > > > > >>>>>>>>>> > >> > > > > > >>>>>>>>>>> have a try. > >> > > > > > >>>>>>>>>>>> Does anyone have any ideas on how to deploy this > >> > > > zk_wrapper > >> > > > > in > >> > > > > > >>>>>>>>>>>> > >> > > > > > >>>>>>>>>>> JNI > >> > > > > > >>>> > >> > > > > > >>>>> integration? > >> > > > > > >>>>>>>>>>>> > >> > > > > > >>>>>>>>>>>> I would be very appreciated. > >> > > > > > >>>>>>>>>>>> > >> > > > > > >>>>>>>>>>>> thanks > >> > > > > > >>>>>>>>>>>> zhong > >> > > > > > >>>>>>>>>>>> > >> > > > > > >>>>>>>>>>>> > >> > > > > > >>>>>>>>>>>> > >> > > > > > >>> > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > > > >
