Roman: > 11/11/08 00:44:31 WARN util.Sleeper: We slept 38891ms instead of > 3000ms, this is likely due to a long garbage collecting pause and it's > usually bad, see
3000ms is the default value for hbase.regionserver.msginterval Obviously it is too short for the validation scenario. Can you increase its value and perform another round of test ? Thanks On Mon, Nov 7, 2011 at 10:37 PM, Roman Shaposhnik <[email protected]> wrote: > Forgot to add that from a master UI perspective here's where it is > stuck at: > > $ curl http://master:60010/master-status?format=json > [{"statustimems":-1,"status":"Waiting for distributed tasks to finish. > scheduled=5 done=0 > error=0","starttimems":1320731070095,"description":"Doing distributed > log split in > [hdfs://ip-10-84-202-94.ec2.internal:17020/hbase/.logs/ip-10-114-225-185.ec2.internal,60020,1320726988138-splitting]","state":"RUNNING","statetimems":-1}] > > Regioserver finally dies and if I restart it manually the split seems to be > finishing up as intended. > > Hope this helps. > > Thanks, > Roman. > > On Mon, Nov 7, 2011 at 10:16 PM, Roman Shaposhnik <[email protected]> wrote: > > With HBASE-4754 fix in place I can get further in my testing, > > but it still fails :-( > > > > Here's how it does it this time. It loads OK, but then when it > > needs to split here's what happens: > > > > 11/11/08 00:44:30 INFO handler.ServerShutdownHandler: Splitting logs > > for ip-10-114-225-185.ec2.internal,60020,1320726988138 > > 11/11/08 00:44:30 INFO master.SplitLogManager: dead splitlog worker > > ip-10-114-225-185.ec2.internal,60020,1320726988138 > > 11/11/08 00:44:30 INFO master.SplitLogManager: started splitting logs > > in > [hdfs://ip-10-84-202-94.ec2.internal:17020/hbase/.logs/ip-10-114-225-185.ec2.internal,60020,1320726988138-splitting] > > 11/11/08 00:44:31 ERROR master.HMaster: Region server > > ^@^@ip-10-114-225-185.ec2.internal,60020,1320726988138 reported a > > fatal error: > > ABORTING region server > > ip-10-114-225-185.ec2.internal,60020,1320726988138: Unhandled > > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > > rejected; currently processing > > ip-10-114-225-185.ec2.internal,60020,1320726988138 as dead server > > at > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:222) > > at > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:148) > > at > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:750) > > at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) > > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1306) > > > > That's on the master side, on the regionserver side, it looks really > > weird. It basically hums along > > doing the split and then at some point, there's this: > > > > 11/11/08 00:43:40 INFO regionserver.Store: Added > > > hdfs://ip-10-84-202-94.ec2.internal:17020/hbase/TestLoadAndVerify_1320729464658/8bd8387431feec2b09983693dfac950b/f1/4fc67a93e580402190b5c8a72820f665, > > entries=82049, sequenceid=142942, memsize=18.1m, filesize=4.4m > > 11/11/08 00:43:40 INFO regionserver.HRegion: Finished memstore flush > > of ~18.4m for region > > > TestLoadAndVerify_1320729464658,<\xA1\xAF(k\xCA\x1A\xEA,1320729465485.8bd8387431feec2b09983693dfac950b. > > in 829ms, sequenceid=142942, compaction requested=false > > 11/11/08 00:44:31 INFO zookeeper.ClientCnxn: Unable to read additional > > data from server sessionid 0x133817270190001, likely server has closed > > socket, closing socket connection and attempting reconnect > > 11/11/08 00:44:31 INFO zookeeper.ClientCnxn: Unable to read additional > > data from server sessionid 0x133817270190004, likely server has closed > > socket, closing socket connection and attempting reconnect > > 11/11/08 00:44:31 WARN util.Sleeper: We slept 38891ms instead of > > 3000ms, this is likely due to a long garbage collecting pause and it's > > usually bad, see > > http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9 > > 11/11/08 00:44:31 FATAL regionserver.HRegionServer: ABORTING region > > server ip-10-114-225-185.ec2.internal,60020,1320726988138: Unhandled > > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > > rejected; currently processing > > ip-10-114-225-185.ec2.internal,60020,1320726988138 as dead server > > at > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:222) > > at > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:148) > > at > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:750) > > at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) > > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1306) > > > > > > Thanks, > > Roman. > > >
