[ https://issues.apache.org/jira/browse/HBASE-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731061#action_12731061 ]
Jean-Daniel Cryans commented on HBASE-1439: ------------------------------------------- This log from a recent trunk makes me think we should still do that special watcher (the master split logs around 14:11:58): {code} 2009-07-13 14:10:44,219 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on region t3,,1247502416090 in 0sec 2009-07-13 14:10:46,735 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE: t3,******************,1247507770509: Overloaded 2009-07-13 14:10:46,743 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_CLOSE: t3******************,1247507770509: Overloaded 2009-07-13 14:10:46,743 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing t3,******************,1247507770509: compactions & flushes disabled 2009-07-13 14:10:46,743 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for region, no outstanding scanners on t3,**************,1247507770509 2009-07-13 14:10:46,744 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: No more row locks outstanding on region t3,U**************9,1247507770509 2009-07-13 14:10:46,744 DEBUG org.apache.hadoop.hbase.regionserver.Store: closed block 2009-07-13 14:10:46,744 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed t3,,******************,1247507770509 2009-07-13 14:11:54,283 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=336.81897MB (353180272), Free=59.856064MB (62763632), Max=396.67502MB (415943904), Counts: Blocks=5251, Access=259404, Hit=5232, Miss=2 54172, Evictions=13, Evicted=8223, Ratios: Hit Ratio=2.016931027173996%, Miss Ratio=97.98306822776794%, Evicted/Run=632.5384521484375 2009-07-13 14:11:54,375 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x12274aa40060001 to sun.nio.ch.selectionkeyi...@2b20bf2c java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:858) 2009-07-13 14:11:54,375 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x22274aa40700002 to sun.nio.ch.selectionkeyi...@12746ad0 java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:858) 2009-07-13 14:11:58,996 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Got ZooKeeper event, state: Disconnected, type: None, path: null 2009-07-13 14:11:59,002 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message (Retry: 0) org.apache.hadoop.hbase.Leases$LeaseStillHeldException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) ... 2009-07-13 14:11:59,097 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Exceeded max retries: 10 org.apache.hadoop.hbase.Leases$LeaseStillHeldException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94) at org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48) at org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:561) at java.lang.Thread.run(Thread.java:619) 2009-07-13 14:11:59,262 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server **************/**************:2181 2009-07-13 14:11:59,262 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/**************53663 remote=**************/**************:2181] 2009-07-13 14:11:59,262 INFO org.apache.zookeeper.ClientCnxn: Server connection successful 2009-07-13 14:11:59,635 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server **************/**************:2181 2009-07-13 14:11:59,635 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/**************:44760 remote=**************/**************:2181] 2009-07-13 14:11:59,635 INFO org.apache.zookeeper.ClientCnxn: Server connection successful 2009-07-13 14:12:00,103 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: Expired, type: None, path: null 2009-07-13 14:12:00,104 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x12274aa40060001 to sun.nio.ch.selectionkeyi...@5e297e83 java.io.IOException: Session Expired at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:548) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:661) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:897) 2009-07-13 14:12:00,104 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired 2009-07-13 14:12:00,104 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting Region Server 2009-07-13 14:12:00,105 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Exceeded max retries: 10 org.apache.hadoop.hbase.Leases$LeaseStillHeldException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94) at org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48) at org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:561) at java.lang.Thread.run(Thread.java:619) 2009-07-13 14:12:00,106 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=4, stores=4, storefiles=7, storefileIndexSize=4, memstoreSize=96, usedHeap=1153, maxHeap=1983, blockCacheSize=353180272, blockCacheFree=62763632, blockCacheCount=5251, blockCacheHitRatio=2 2009-07-13 14:12:00,110 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler 9 on 60020 took 1033ms appending an edit to hlog; editcount=60781 2009-07-13 14:12:04,256 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler 9 on 60020 took 4128ms appending an edit to hlog; editcount=62375 2009-07-13 14:12:04,257 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stop was requested, clearing the toDo despite of the exception 2009-07-13 14:12:04,257 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x22274aa40700002 to sun.nio.ch.selectionkeyi...@54f015d5 java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:858) 2009-07-13 14:12:04,257 DEBUG org.apache.hadoop.hbase.RegionHistorian: Offlined 2009-07-13 14:12:04,257 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020 2009-07-13 14:12:04,257 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 0 on 60020: exiting 2009-07-13 14:12:04,257 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 1 on 60020: exiting 2009-07-13 14:12:04,258 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server listener on 60020 2009-07-13 14:12:04,282 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer 2009-07-13 14:12:04,286 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: regionserver/**************:60020.cacheFlusher exiting 2009-07-13 14:12:04,286 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting. 2009-07-13 14:12:04,286 INFO org.apache.hadoop.hbase.Leases: regionserver/**************:60020.leaseChecker closing leases 2009-07-13 14:12:04,286 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: worker thread exiting 2009-07-13 14:12:04,286 INFO org.apache.hadoop.hbase.Leases: regionserver/**************:60020.leaseChecker closed leases 2009-07-13 14:12:04,318 INFO org.apache.hadoop.hbase.regionserver.LogFlusher: regionserver/**************:60020.logFlusher exiting 2009-07-13 14:12:04,484 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server Responder 2009-07-13 14:12:04,494 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 2 on 60020: exiting 2009-07-13 14:12:04,512 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 4 on 60020: exiting 2009-07-13 14:12:04,512 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 6 on 60020: exiting 2009-07-13 14:12:04,512 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 5 on 60020: exiting 2009-07-13 14:12:04,512 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 8 on 60020: exiting 2009-07-13 14:12:04,512 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 3 on 60020: exiting 2009-07-13 14:12:04,513 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 7 on 60020: exiting 2009-07-13 14:12:05,076 DEBUG org.apache.hadoop.hbase.regionserver.HLog: closing hlog writer in hdfs://**************:50000/hbase/.logs/**************,60020,1247497897306 2009-07-13 14:12:05,076 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: java.io.IOException: Cannot append; log is closed 2009-07-13 14:12:05,076 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: regionserver/**************:60020.compactor exiting 2009-07-13 14:12:05,076 INFO org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker: regionserver/**************:60020.majorCompactionChecker exiting 2009-07-13 14:12:05,138 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log in abort java.io.IOException: java.io.IOException: Could not complete write to file /hbase/.logs/**************60020,1247497897306/hlog.dat.1247508029747 by DFSClient_51605845 at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:449) at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94) at org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:626) at java.lang.Thread.run(Thread.java:619) 2009-07-13 14:12:05,138 INFO org.apache.hadoop.fs.FileSystem: Could not cancel cleanup thread, though no FileSystems are open 2009-07-13 14:12:05,138 ERROR org.apache.hadoop.hbase.util.FSUtils: file system close failed: java.lang.NullPointerException at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeThreads(DFSClient.java:3127) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3170) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3115) at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java:1002) at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:208) at org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:269) at org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:119) at org.apache.hadoop.hbase.regionserver.HRegionServer.checkFileSystem(HRegionServer.java:861) at org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:809) at org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:792) at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1793) at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:643) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:913) 2009-07-13 14:12:05,138 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: closing region t3,************** 2009-07-13 14:12:05,138 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down HRegionServer: file system not available java.io.IOException: File system is not available at org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:123) at org.apache.hadoop.hbase.regionserver.HRegionServer.checkFileSystem(HRegionServer.java:861) at org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:809) at org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:792) at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1793) at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:643) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:913) Caused by: java.io.IOException: Call to **************1/**************:50000 failed on local exception: java.nio.channels.ClosedChannelException at org.apache.hadoop.ipc.Client.wrapException(Client.java:774) at org.apache.hadoop.ipc.Client.call(Client.java:742) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy1.getFileInfo(Unknown Source) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy1.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:587) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643) at org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:112) ... 9 more Caused by: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:113) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:156) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) 2009-07-13 14:12:05,138 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing t3,**************: compactions & flushes disabled 2009-07-13 14:12:05,143 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for region, no outstanding scanners on t3,************** 2009-07-13 14:12:05,143 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=4, stores=4, storefiles=7, storefileIndexSize=4, memstoreSize=96, usedHeap=1156, maxHeap=1983, blockCacheSize=353180272, blockCacheFree=62763632, blockCacheCount=5251, blockCacheHitRatio=2 2009-07-13 14:12:05,143 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: No more row locks outstanding on region t3,************** 2009-07-13 14:12:05,144 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9 on 60020, call put([...@1df68388, [Lorg.apache.hadoop.hbase.client.Put;@23356516) from **************:48126: error: java.io.IOException: Cannot append; log is closed java.io.IOException: Cannot append; log is closed at org.apache.hadoop.hbase.regionserver.HLog.append(HLog.java:584) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1446) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1246) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1209) at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1785) at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:643) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:913) 2009-07-13 14:12:05,151 DEBUG org.apache.hadoop.hbase.regionserver.Store: closed block 2009-07-13 14:12:05,152 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed t3,************** 2009-07-13 14:12:05,152 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: closing region t3,************** 2009-07-13 14:12:05,152 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing t3,**************: compactions & flushes disabled 2009-07-13 14:12:05,152 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for region, no outstanding scanners on t3************** 2009-07-13 14:12:05,152 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: No more row locks outstanding on region t3,************** 2009-07-13 14:12:05,152 DEBUG org.apache.hadoop.hbase.regionserver.Store: closed block 2009-07-13 14:12:05,152 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed t3,************** 2009-07-13 14:12:05,152 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: closing region t3,,1247502416090 2009-07-13 14:12:05,152 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing t3,,1247502416090: compactions & flushes disabled 2009-07-13 14:12:05,152 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for region, no outstanding scanners on t3,,1247502416090 2009-07-13 14:12:05,152 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: No more row locks outstanding on region t3,,1247502416090 2009-07-13 14:12:05,152 DEBUG org.apache.hadoop.hbase.regionserver.Store: closed block 2009-07-13 14:12:05,152 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed t3,,1247502416090 2009-07-13 14:12:05,152 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: closing region t3,************** 2009-07-13 14:12:05,152 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing t3,**************: compactions & flushes disabled 2009-07-13 14:12:05,152 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for region, no outstanding scanners on t3,************** 2009-07-13 14:12:05,152 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: No more row locks outstanding on region t3,************** -07-13 14:12:05,153 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call put([...@1df68388, [Lorg.apache.hadoop.hbase.client.Put;@23356516) from **************:48126: output error 2009-07-13 14:12:05,156 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9 on 60020 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1122) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:613) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:677) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:941) 2009-07-13 14:12:05,156 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9 on 60020: exiting 2009-07-13 14:12:05,156 DEBUG org.apache.hadoop.hbase.regionserver.Store: closed block 2009-07-13 14:12:05,156 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed t3,************** 2009-07-13 14:12:05,156 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server at: **************:60020 2009-07-13 14:12:05,157 INFO org.apache.zookeeper.ZooKeeper: Closing session: 0x12274aa40060001 2009-07-13 14:12:05,157 INFO org.apache.zookeeper.ClientCnxn: Closing ClientCnxn for session: 0x12274aa40060001 2009-07-13 14:12:05,157 INFO org.apache.zookeeper.ClientCnxn: Disconnecting ClientCnxn for session: 0x12274aa40060001 2009-07-13 14:12:05,157 INFO org.apache.zookeeper.ZooKeeper: Session: 0x12274aa40060001 closed 2009-07-13 14:12:05,157 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Closed connection with ZooKeeper 2009-07-13 14:12:05,157 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/**************:60020 exiting 2009-07-13 14:12:05,172 INFO org.apache.hadoop.hbase.ipc.HBaseRpcMetrics: Initializing RPC Metrics with hostName=HRegionServer, port=60020 2009-07-13 14:12:05,181 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: globalMemStoreLimit=793.4m, globalMemStoreLimitLowMark=495.8m, maxHeap=1.9g 2009-07-13 14:12:05,181 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread. 2009-07-13 14:12:05,195 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete 2009-07-13 14:12:05,195 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Runs every 10000000ms 2009-07-13 14:12:05,219 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server **************/**************:2181 2009-07-13 14:12:05,219 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/**************:33151 remote=**************/**************:2181] 2009-07-13 14:12:05,219 INFO org.apache.zookeeper.ClientCnxn: Server connection successful 2009-07-13 14:12:05,380 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, host=*****************:2181 sessionTimeout=30000 watcher=org.apache.hadoop.hbase.regionserver.hregionser...@47315d34 2009-07-13 14:12:10,636 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2009-07-13 14:12:10,636 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server **************/**************:2181 2009-07-13 14:12:10,636 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:710) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:425) at java.lang.Thread.run(Thread.java:619) 2009-07-13 14:12:10,636 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled exception. Aborting... java.io.IOException: Region server startup failed at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:829) at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:748) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:425) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:710) ... 2 more 2009-07-13 14:12:10,636 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=4, stores=4, storefiles=7, storefileIndexSize=4, memstoreSize=96, usedHeap=1173, maxHeap=1983, blockCacheSize=353180272, blockCacheFree=62763632, blockCacheCount=5251, blockCacheHitRatio=2 2009-07-13 14:12:10,636 DEBUG org.apache.hadoop.hbase.RegionHistorian: Offlined 2009-07-13 14:12:11,084 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/**************:52015 remote=**************/**************:2181] 2009-07-13 14:12:11,084 INFO org.apache.zookeeper.ClientCnxn: Server connection successful 2009-07-13 14:12:11,086 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020 2009-07-13 14:12:11,086 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x22274aa40700002 to sun.nio.ch.selectionkeyi...@411cac1d java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:858) 2009-07-13 14:12:11,090 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer 2009-07-13 14:12:11,135 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server at: **************:60020 2009-07-13 14:12:11,135 INFO org.apache.zookeeper.ZooKeeper: Closing session: 0x0 2009-07-13 14:12:11,135 INFO org.apache.zookeeper.ClientCnxn: Closing ClientCnxn for session: 0x0 2009-07-13 14:12:11,365 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: None, path: null 2009-07-13 14:12:11,365 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Ignoring ZooKeeper event while shutting down 2009-07-13 14:12:11,417 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server **************/**************:2181 2009-07-13 14:12:11,421 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/**************52016 remote=**************/**************:2181] 2009-07-13 14:12:11,421 INFO org.apache.zookeeper.ClientCnxn: Server connection successful 2009-07-13 14:12:11,422 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x22274aa40700002 to sun.nio.ch.selectionkeyi...@44a45d2c java.io.IOException: Session Expired at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:548) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:661) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:897) 2009-07-13 14:12:11,422 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Got ZooKeeper event, state: Expired, type: None, path: null 2009-07-13 14:12:11,422 INFO org.apache.zookeeper.ZooKeeper: Closing session: 0x22274aa40700002 2009-07-13 14:12:11,422 INFO org.apache.zookeeper.ClientCnxn: Closing ClientCnxn for session: 0x22274aa40700002 2009-07-13 14:12:11,422 INFO org.apache.zookeeper.ClientCnxn: Disconnecting ClientCnxn for session: 0x22274aa40700002 2009-07-13 14:12:11,422 INFO org.apache.zookeeper.ZooKeeper: Session: 0x22274aa40700002 closed 2009-07-13 14:12:11,422 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Closed connection with ZooKeeper 2009-07-13 14:12:11,422 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2009-07-13 14:12:11,997 INFO org.apache.zookeeper.ClientCnxn: Exception while closing send thread for session 0x12274aa40060005 : Read error rc = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] 2009-07-13 14:12:12,101 INFO org.apache.zookeeper.ClientCnxn: Disconnecting ClientCnxn for session: 0x12274aa40060005 2009-07-13 14:12:12,101 INFO org.apache.zookeeper.ZooKeeper: Session: 0x12274aa40060005 closed 2009-07-13 14:12:12,101 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2009-07-13 14:12:12,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Closed connection with ZooKeeper 2009-07-13 14:12:12,149 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/**************:60020 exiting {code} > race between master and regionserver after missed heartbeat > ----------------------------------------------------------- > > Key: HBASE-1439 > URL: https://issues.apache.org/jira/browse/HBASE-1439 > Project: Hadoop HBase > Issue Type: Bug > Affects Versions: 0.19.1 > Environment: CentOS 5.2 x86_64, HBase 0.19.1, Hadoop 0.19.1 > Reporter: Andrew Purtell > Priority: Blocker > Fix For: 0.20.1 > > > Seen on one of our 0.19.1 clusters: > {code} > java.io.FileNotFoundException: File does not exist: > hdfs://jdc2-atr-dc-2.atr.trendmicro.com:50000 > /data/hbase/log_10.3.134.207_1242286427894_60020/hlog.dat.1242528291898 > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:415) > at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:679) > at > org.apache.hadoop.hbase.io.SequenceFile$Reader.<init>(SequenceFile.java:1431) > at > org.apache.hadoop.hbase.io.SequenceFile$Reader.<init>(SequenceFile.java:1426) > at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:753) > at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:716) > at > org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:249) > at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:442) > at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:377) > 2009-05-17 04:05:55,481 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: process > shutdown of server 10.3.134.207:60020: logSplit: false, rootRescanned: false, > numberOfMetaRegions: 1, > onlineMetaRegions.size(): 1 > {code} > I do not have the region server log yet, but here is my conjecture: > Here, the write ahead log for 10.3.134.207 is missing in DFS: > java.io.FileNotFoundException: > hdfs://jdc2-atr-dc-2.atr.trendmicro.com:50000/data/hbase/log_10.3.134.207_1242286427894_60020/hlog.dat.1242528291898 > when the master tries to split it after declaring the region server crashed. > There have been recent trouble reports on this cluster that indicate severe > memory stress, e.g. kernel panics due to OOM. Based on that I think it is > likely that the region server here missed a heartbeat so the master declared > it crashed and began to split the log. But, the log was then deleted out from > underneath the master's split thread. I think the region server was actually > still running but partially swapped out or the node was otherwise overloaded > so it missed its heartbeat. Then, when the region server came back after > being swapped in, it realized it missed its heartbeat and shut down, deleting > the log as is normally done. > Even if the above is not the actual cause in this case, I think the scenario > is plausible. What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.