On Thu, Oct 29, 2009 at 11:46 AM, Zhenyu Zhong <[email protected]>wrote:
> FYI > It looks like increasing the number of Zookeeper Quorums can solve the > following error message : org.apache.hadoop.hbase. > client.NoServerForRegionException: Timed out trying to locate root region > at > org.apache.hadoop.hbase. > > You mean quorum members? How many do you have now? > Now I am running Zookeeper quorum on each node I have. > However, I am still having issues about losing regionserver. > > Whats in the logs? > Is there a way to browse the Znode in zookeeper? > > Type 'zk' in the hbase shell. You can get to the zk shell from hbase shell. You so things like: > zk "ls /" (Yes, quotes needed). St.Ack > thanks > zhenyu > > > > > > > On Wed, Oct 28, 2009 at 3:40 PM, Zhenyu Zhong <[email protected] > >wrote: > > > JG, > > > > > > Thanks a lot for the tips. > > I set the HEAP to 4GB and GC options as -XX:ParallelGCThreads=8 > > -XX:+UseConcMarkSweepGC. > > > > I checked the logs in my Master an RS and found the following errors. > > Basically, master got exception error while scanning ROOT, then the ROOT > > region was offline and unset. Thus the regionserver can't get > > NotservingRegion errors. > > > > In the master: > > 2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: > > RegionManager.rootScanner scanning meta region {server: x.x.x. > > x:60021, regionname: -ROOT-,,0, startKey: <>} > > 2009-10-28 19:00:30,591 WARN org.apache.hadoop.hbase.master.BaseScanner: > > Scan ROOT region > > java.io.IOException: Call to /x.x.x.x:60021 failed on local exception: > > java.io.EOFException > > at > > > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:757) > > at > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:727) > > at > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > > at $Proxy1.openScanner(Unknown Source) > > at > > > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > > at > > org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54) > > at > > > org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79) > > at > > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > > Caused by: java.io.EOFException > > at java.io.DataInputStream.readInt(DataInputStream.java:375) > > at > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:504) > > at > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:448) > > 2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: > > RegionManager.metaScanner scanning meta region {server: x.x.x. > > x:60021, regionname: .META.,,1, startKey: <>} > > 2009-10-28 19:00:30,591 WARN org.apache.hadoop.hbase.master.BaseScanner: > > Scan one META region: {server: x.x.x.x:60021, regionname: .M > > ETA.,,1, startKey: <>} > > java.net.ConnectException: Connection refused > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > > at > > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) > > at > > > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) > > at > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308) > > at > > > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831) > > at > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712) > > at > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > > at $Proxy1.openScanner(Unknown Source) > > at > > > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > > at > > > org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73) > > at > > > org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129) > > at > > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > > 2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: > > All 1 .META. region(s) scanned > > 2009-10-28 19:00:31,395 INFO > org.apache.hadoop.hbase.master.ServerManager: > > Removing server's info YYYY,60021,125675547057 > > 0 > > 2009-10-28 19:00:31,395 INFO > org.apache.hadoop.hbase.master.RegionManager: > > Offlined ROOT server: x.x.x.x:60021 > > > > 2009-10-28 19:00:31,395 INFO > org.apache.hadoop.hbase.master.RegionManager: > > -ROOT- region unset (but not set to be reassigned) > > 2009-10-28 19:00:31,395 INFO > org.apache.hadoop.hbase.master.RegionManager: > > ROOT inserted into regionsInTransition > > 2009-10-28 19:00:31,395 INFO > org.apache.hadoop.hbase.master.RegionManager: > > Offlining META region: {server: x.x.x.x:60021, regionname: .META.,,1, > > startKey: <>} > > 2009-10-28 19:00:31,395 INFO > org.apache.hadoop.hbase.master.RegionManager: > > META region removed from onlineMetaRegions > > > > > > > > On the regionserver: > > 2009-10-28 18:51:14,578 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: > > test,,1256755871065 > > 2009-10-28 18:51:14,578 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: > MSG_REGION_OPEN: > > test,,1256755871065 > > 2009-10-28 18:51:14,578 INFO > org.apache.hadoop.hbase.regionserver.HRegion: > > region test,,1256755871065/796855017 available; sequence id is 10013291 > > 2009-10-28 18:51:14,578 INFO > org.apache.hadoop.hbase.regionserver.HRegion: > > Starting compaction on region test,,1256755871065 > > 2009-10-28 18:51:18,388 DEBUG org.apache.zookeeper.ClientCnxn: Got ping > > response for sessionid:0x249c76021d0001 after 0ms > > 2009-10-28 18:51:19,341 ERROR > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > org.apache.hadoop.hbase.NotServingRegionException: test,,1256754924503 > > at > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307) > > at > > > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784) > > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) > > at > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) > > 2009-10-28 18:51:19,341 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server > > handler 0 on 60021, call get([...@21fefd80, row=1053508149, maxVersions=1, > > timeRange=[0,9223372036854775807), families={(family=email_ip_activity, > > columns=ALL}) from x.x.x.x:54669: error: > > org.apache.hadoop.hbase.NotServingRegionException: test,,1256754924503 > > > > > > > > > > > > > > On Wed, Oct 28, 2009 at 2:56 PM, Jonathan Gray <[email protected]> > wrote: > > > >> These client error messages are not particular descriptive as to the > root > >> cause (they are fatal errors, or close to it). > >> > >> What is going on in your regionservers when these errors happen? Check > >> the master and RS logs. > >> > >> Also, you definitely do not want 19 zookeeper nodes. Reduce that to 3 > or > >> 5 max. > >> > >> What is the hardware you are using for these nodes, and what settings do > >> you have for heap/GC? > >> > >> JG > >> > >> > >> Zhenyu Zhong wrote: > >> > >>> Stack, > >>> > >>> Thank you very much for your comments. > >>> I am running a cluster with 20 nodes. I set 19 as both regionserver and > >>> zookeeper quorums. > >>> The versions I am using are Hadoop0.20.1 and HBase0.20.1. > >>> I started with an empty table and try to load 200 million records into > >>> that > >>> empty table. > >>> There is a key in each record. Logically, in my MR program, during the > >>> setup, I opened an HTable, in my mapper, I fetch the record from HTable > >>> via > >>> key in the record, then make some changes to the columns and update > that > >>> row > >>> back to HTable through TableOutputFormat by passing a put. There is no > >>> reduce tasks involved here. (Though it is unnecessary to fetch row > from > >>> an > >>> empty table, I just intended to do that) > >>> > >>> Additionally, when I reduced the number of regionservers and number of > >>> zookeeper quorums. > >>> I had different errors: > >>> org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out > >>> trying > >>> to locate root region at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556) > >>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at > >>> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at > >>> > >>> > org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116) > >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) at > >>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at > >>> org.apache.hadoop.mapred.Child.main(Child.java:170) > >>> > >>> Many thanks in advance. > >>> zhenyu > >>> > >>> > >>> > >>> > >>> On Wed, Oct 28, 2009 at 12:39 PM, stack <[email protected]> wrote: > >>> > >>> Whats your cluster topology? How many nodes involved? When you see > the > >>>> below message, how many regions in your table? How are you loading > your > >>>> table? > >>>> Thanks, > >>>> St.Ack > >>>> > >>>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong < > [email protected] > >>>> > >>>>> wrote: > >>>>> Nitay, > >>>>> > >>>>> I am very appreciated. > >>>>> > >>>>> As Ryan suggested, I increased the zookeeper session timeout to > >>>>> 40seconds > >>>>> along with the GC options -XX:ParallelGCThreads=8 > >>>>> > >>>> -XX:+UseConcMarkSweepGC > >>>> > >>>>> in place. I set the Heapsize to 4GB. I also set the vm.swappiness=0. > >>>>> > >>>>> However it still ran into problem. Please find the following errors. > >>>>> > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to > >>>>> contact region server x.x.x.x:60021 for region > >>>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but failed after > 10 > >>>>> attempts. > >>>>> Exceptions: > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> > >>>>> at > >>>>> > >>>>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) > >>>> > >>>>> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) > >>>>> > >>>>> > >>>>> The input file is about 10GB around 200million rows of data. > >>>>> This load doesn't seem too large. However this kind of errors keep > >>>>> > >>>> popping > >>>> > >>>>> up. > >>>>> > >>>>> Does Regionserver need to be deployed to dedicated machines? > >>>>> Does Zookeeper need to be deployed to dedicated machines as well? > >>>>> > >>>>> Best, > >>>>> zhenyu > >>>>> > >>>>> > >>>>> > >>>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <[email protected]> wrote: > >>>>> > >>>>> Hi Zhenyu, > >>>>>> > >>>>>> Sorry for the delay. I started working on this a while back, before > I > >>>>>> > >>>>> left > >>>>> > >>>>>> my job for another company. Since then I haven't had much time to > work > >>>>>> > >>>>> on > >>>> > >>>>> HBase unfortunately :(. I'll try to dig up what I had and see what > >>>>>> > >>>>> shape > >>>> > >>>>> it's in and update you. > >>>>>> > >>>>>> Cheers, > >>>>>> -n > >>>>>> > >>>>>> > >>>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: > >>>>>> > >>>>>> Sorry I must have mistyped, I meant to say "40 seconds". You can > >>>>>> > >>>>>>> still see multi-second pauses at times, so you need to give > yourself > >>>>>>> a > >>>>>>> bigger buffer. > >>>>>>> > >>>>>>> The parallel threads argument should not be necessary, but you do > >>>>>>> need > >>>>>>> the UseConcMarkSweepGC flag as well. > >>>>>>> > >>>>>>> Let us know how it goes! > >>>>>>> -ryan > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong < > >>>>>>> > >>>>>> [email protected]> > >>>> > >>>>> wrote: > >>>>>>> > >>>>>>> Ryan, > >>>>>>>> I am very appreciated for your feedbacks. > >>>>>>>> I have set the zookeeper.session.timeout to seconds which is way > >>>>>>>> > >>>>>>> higher > >>>> > >>>>> than > >>>>>>>> 40ms. > >>>>>>>> In the same time, the -Xms is set to 4GB, which should be > >>>>>>>> sufficient. > >>>>>>>> I also tried GC options like > >>>>>>>> > >>>>>>>> -XX:ParallelGCThreads=8 > >>>>>>>> -XX:+UseConcMarkSweepGC > >>>>>>>> > >>>>>>>> I even set the vm.swappiness=0 > >>>>>>>> > >>>>>>>> However, I still came across the problem that a RegionServer > >>>>>>>> shutdown > >>>>>>>> itself. > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> zhong > >>>>>>>> > >>>>>>>> > >>>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <[email protected]> > >>>>>>>> > >>>>>>> wrote: > >>>>> > >>>>>> Set the ZK timeout to something like 40ms, and give the GC enough > >>>>>>>> > >>>>>>> Xmx > >>>> > >>>>> so you never risk entering the much dreaded concurrent-mode-failure > >>>>>>>>> whereby the entire heap must be GCed. > >>>>>>>>> > >>>>>>>>> Consider testing Java 7 and the G1 GC. > >>>>>>>>> > >>>>>>>>> We could get a JNI thread to do this, but no one has done so yet. > I > >>>>>>>>> > >>>>>>>> am > >>>> > >>>>> personally hoping for G1 and in the meantime overprovision our Xmx > >>>>>>>>> > >>>>>>>> to > >>>> > >>>>> avoid the concurrent mode failures. > >>>>>>>>> > >>>>>>>>> -ryan > >>>>>>>>> > >>>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong < > >>>>>>>>> > >>>>>>>> [email protected]> > >>>>> > >>>>>> wrote: > >>>>>>>>> > >>>>>>>>> Ryan, > >>>>>>>>>> > >>>>>>>>>> Thank you very much. > >>>>>>>>>> May I ask whether there are any ways to get around this problem > to > >>>>>>>>>> > >>>>>>>>> make > >>>>> > >>>>>> HBase more stable? > >>>>>>>>>> > >>>>>>>>>> best, > >>>>>>>>>> zhong > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson < > [email protected]> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> There isnt any working code yet. Just an idea, and a prototype. > >>>>>>>>>> > >>>>>>>>>>> There is some sense that if we can get the G1 GC that we could > >>>>>>>>>>> get > >>>>>>>>>>> > >>>>>>>>>> rid > >>>>> > >>>>>> of all long pauses, and avoid the need for this. > >>>>>>>>>>> > >>>>>>>>>>> -ryan > >>>>>>>>>>> > >>>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < > >>>>>>>>>>> [email protected]> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi, > >>>>>>>>>>>> > >>>>>>>>>>>> I am very interesting to the solution that Joey proposed and > >>>>>>>>>>>> > >>>>>>>>>>> would > >>>> > >>>>> like > >>>>>>>>>>> > >>>>>>>>>> to > >>>>>>>>>> > >>>>>>>>>>> have a try. > >>>>>>>>>>>> Does anyone have any ideas on how to deploy this zk_wrapper in > >>>>>>>>>>>> > >>>>>>>>>>> JNI > >>>> > >>>>> integration? > >>>>>>>>>>>> > >>>>>>>>>>>> I would be very appreciated. > >>>>>>>>>>>> > >>>>>>>>>>>> thanks > >>>>>>>>>>>> zhong > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>> > > >
