[
https://issues.apache.org/jira/browse/HBASE-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018791#comment-13018791
]
gaojinchao commented on HBASE-3722:
-----------------------------------
In my cluster :
1.HDFS cluster is HA namenode( ANN and BNN)
2.HBASE Version 0.90.1:
Active Hmaster: C4C1
Backup Hmaster: C4C2
Region server: C4C3,C4C4,C4C5,...
operation:
1.ANN crashed and BNN becomed Active(that needs some time)
2.Some region server crashed(eg:C4C3 has meta table) that Hbase client is
putting into data and some Region server is ok.
3.Hmaster split hlog failed and skip it.
4.BNN had been active and Hmaster had finished processed shutdown event.
5.A lots of data is lost that region server had crashed.
log as:
14:57:58 C4C3 shutdow itself because of ANN crashed.
skip splitlog and ressigned Meta table.
2011-04-12 14:57:58,782 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs
for C4C3.site,60020,1302590910433
2011-04-12 14:57:59,790 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: C4C1/157.5.100.1:9000. Already tried 0 time(s).
....
2011-04-12 14:58:08,793 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: C4C1/157.5.100.1:9000. Already tried 9 time(s).
2011-04-12 14:58:08,795 ERROR org.apache.hadoop.hbase.master.MasterFileSystem:
Failed splitting hdfs://C4C1:9000/hbase/.logs/C4C3.site,60020,1302590910433
java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on connection
exception: java.net.ConnectException: Connection refused
2011-04-12 14:58:08,805 INFO
org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting ROOT region
location in ZooKeeper
2011-04-12 14:58:08,880 INFO org.apache.hadoop.hbase.catalog.CatalogTracker:
Failed verification of .META.,,1 at address=C4C3.site:60020;
java.net.ConnectException: Connection refused
2011-04-12 14:58:08,880 INFO org.apache.hadoop.hbase.catalog.CatalogTracker:
Current cached META location is not valid, resetting
Hmaster finished process shutdown event when BNN becomes active and meta table
ressigned
2011-04-12 15:00:31,681 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: C4C1/157.5.100.1:9000. Already tried 0 time(s).
2011-04-12 15:00:32,682 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: C4C1/157.5.100.1:9000. Already tried 1 time(s).
2011-04-12 15:00:40,698 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Regions in transition timed out: .META.,,1.1028785192 state=OPENING,
ts=1302591600701
2011-04-12 15:00:40,699 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Region has been OPENING for too long, reassigning region=.META.,,1.1028785192
2011-04-12 15:00:40,709 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Successfully transitioned region=.META.,,1.1028785192 into OFFLINE and forcing
a new assignment
2011-04-12 15:00:40,712 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Regions in transition timed out: -ROOT-,,0.70236052 state=OPENING,
ts=1302591600718
2011-04-12 15:00:40,712 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Region has been OPENING for too long, reassigning region=-ROOT-,,0.70236052
2011-04-12 15:00:40,725 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Successfully transitioned region=-ROOT-,,0.70236052 into OFFLINE and forcing a
new assignment
2011-04-12 15:00:40,892 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker:
Detected completed assignment of META, notifying catalog tracker
2011-04-12 15:00:45,870 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 0
region(s) that C4C3.site,60020,1302590910433 was carrying (skipping 0
regions(s) that are already in transition)
2011-04-12 15:00:45,870 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished
processing of shutdown of C4C3.site,60020,1302590910433
It has been lost that the Hlog is skipped if Hmaster don't restart when NN
recovered.
so I think Hmaster should shutdown itslef when NN crashed.
like as region server roll Hlog shutdowns itself when it catchs any IO
exception.
> A lot of data is lost when name node crashed
> ---------------------------------------------
>
> Key: HBASE-3722
> URL: https://issues.apache.org/jira/browse/HBASE-3722
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.1
> Reporter: gaojinchao
> Attachments: HmasterFilesystem_PatchV1.patch
>
>
> I'm not sure exactly what arose it. there is some split failed logs .
> the master should shutdown itself when the HDFS is crashed.
> The logs is :
> 2011-03-22 13:21:55,056 WARN
> org.apache.hadoop.hbase.master.LogCleaner: Error while cleaning the
> logs
> java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on
> connection exception: java.net.ConnectException: Connection refused
> at org.apache.hadoop.ipc.Client.wrapException(Client.java:844)
> at org.apache.hadoop.ipc.Client.call(Client.java:820)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221)
> at $Proxy5.getListing(Unknown Source)
> at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> at $Proxy5.getListing(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:614)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:252)
> at
> org.apache.hadoop.hbase.master.LogCleaner.chore(LogCleaner.java:121)
> at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> at
> org.apache.hadoop.hbase.master.LogCleaner.run(LogCleaner.java:154)
> Caused by: java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
> at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:332)
> at
> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:202)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:943)
> at org.apache.hadoop.ipc.Client.call(Client.java:788)
> ... 13 more
> 2011-03-22 13:21:56,056 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 0 time(s).
> 2011-03-22 13:21:57,057 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 1 time(s).
> 2011-03-22 13:21:58,057 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 2 time(s).
> 2011-03-22 13:21:59,057 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 3 time(s).
> 2011-03-22 13:22:00,058 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 4 time(s).
> 2011-03-22 13:22:01,058 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 5 time(s).
> 2011-03-22 13:22:02,059 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 6 time(s).
> 2011-03-22 13:22:03,059 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 7 time(s).
> 2011-03-22 13:22:04,059 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 8 time(s).
> 2011-03-22 13:22:05,060 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 9 time(s).
> 2011-03-22 13:22:05,060 ERROR
> org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting
> hdfs://C4C1:9000/hbase/.logs/C4C9.site,60020,1300767633398
> java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on
> connection exception: java.net.ConnectException: Connection refused
> at org.apache.hadoop.ipc.Client.wrapException(Client.java:844)
> at org.apache.hadoop.ipc.Client.call(Client.java:820)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221)
> at $Proxy5.getFileInfo(Unknown Source)
> at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> at $Proxy5.getFileInfo(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:623)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:461)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:690)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177)
> at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196)
> at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:95)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
> at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:332)
> at
> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:202)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:943)
> at org.apache.hadoop.ipc.Client.call(Client.java:788)
> ... 18 more
> 2011-03-22 13:22:45,600 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 0 time(s).
> 2011-03-22 13:22:46,600 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 1 time(s).
> 2011-03-22 13:22:47,601 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 2 time(s).
> 2011-03-22 13:22:48,601 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 3 time(s).
> 2011-03-22 13:22:49,601 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 4 time(s).
> 2011-03-22 13:22:50,602 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 5 time(s).
> 2011-03-22 13:22:51,602 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 6 time(s).
> 2011-03-22 13:22:52,602 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 7 time(s).
> 2011-03-22 13:22:53,603 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 8 time(s).
> 2011-03-22 13:22:54,603 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: C4C1/157.5.100.1:9000. Already tried 9 time(s).
> 2011-03-22 13:22:54,603 WARN
> org.apache.hadoop.hbase.master.LogCleaner: Error while cleaning the
> logs
> java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on
> connection exception: java.net.ConnectException: Connection refused
> at org.apache.hadoop.ipc.Client.wrapException(Client.java:844)
> at org.apache.hadoop.ipc.Client.call(Client.java:820)
> at org.apache.hadoop.ipc.RPC$Invok
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira