[
https://issues.apache.org/jira/browse/HBASE-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Enis Soztutar resolved HBASE-7385.
----------------------------------
Resolution: Duplicate
Duplicate of HBASE-7507.
> Do not abort regionserver if StoreFlusher.flushCache() fails
> ------------------------------------------------------------
>
> Key: HBASE-7385
> URL: https://issues.apache.org/jira/browse/HBASE-7385
> Project: HBase
> Issue Type: Improvement
> Components: io, regionserver
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
>
> A rare NN failover may cause RS abort, in the following sequence of events:
> - RS tries to flush the memstore
> - Create a file, start block, and acquire a lease
> - Block is complete, lease removed, but before we send the RPC response back
> to the client, NN is killed.
> - New NN comes up, client retries the block complete again, the new NN
> throws lease expired since the block was already complete.
> - RS receives the exception, and aborts.
> This is actually a NN+DFSClient issue that, the dfs client from RS does not
> receive the rpc response about the block close, and upon retry on the new NN,
> it gets the exception, since the file was already closed. However, although
> this is DFS client specific, we can also make RS more resilient by not
> aborting the RS upon exception from the flushCache(). We can change
> StoreFlusher so that:
> StoreFlusher.prepare() will become idempotent (so will Memstore.snapshot())
> StoreFlusher.flushCache() will throw with IOException upon DFS exception, but
> we catch IOException, and just abort the flush request (not RS).
> StoreFlusher.commit() still cause RS abort on exception. This is also
> debatable. If dfs is alive, and we can undo the flush changes, than we should
> not abort.
> logs:
> {code}
> org.apache.hadoop.hbase.DroppedSnapshotException: region:
> loadtest_ha,e6666658,1355820729877.298bcbd550b80507a379fe67eefbe5ea.
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1485)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1364)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:896)
> at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:845)
> at
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:119)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
> /apps/hbase/data/loadtest_ha/298bcbd550b80507a379fe67eefbe5ea/.tmp/5cf8951ee12449ce8e4e6dd0bf1645c2
> File is not open for writing. [Lease. Holder:
> DFSClient_hb_rs_XXX,60020,1355813552066_203591774_25, pendingcreates: 1]
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1724)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1707)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1762)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1750)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:779)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
> at org.apache.hadoop.ipc.Client.call(Client.java:1107)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
> at $Proxy10.complete(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
> at $Proxy10.complete(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:4087)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3988)
> at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
> at
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
> at
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:255)
> at
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.close(HFileWriterV2.java:432)
> at
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:1214)
> at
> org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:762)
> at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:674)
> at org.apache.hadoop.hbase.regionserver.Store.access$400(Store.java:109)
> at
> org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:2286)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1460)
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira