[ 
https://issues.apache.org/jira/browse/HBASE-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128532#comment-16128532
 ] 

ramkrishna.s.vasudevan commented on HBASE-17406:
------------------------------------------------

After spending some considerable amout of time I have tried different cases 
here.
First case
=======
A region undergoes compaction. There are no active scans. It goes for a split. 
And parallely the CompactedFileDischarger also runs.
Say we had 6 files initially and 3 files got compacted. So now under 
StoreFileManager(SFM) we will have 4 files and under compacted files we will 
have 3 files. 
So when theCompactedFileDischarger   runs first it will try to archive the 3 
compacted files. When this region is getting closed as part of split() it will 
wait till the CompactedFileDischarger completes (due to archiveLock). When the 
splitted parent region closes it will only have those 4 files on which it will 
create the references and only those files are getting opened by the daughter 
regions.

Second case:
=========
Same as above if the close() of parent region happens first, then the close 
gets completed and again creates references for the 4 files and the 3 compacted 
files are moved to archive(CompactedFileDischarger waits due to archiveLock). 
So the new daugher regions work on the references created over the 4 files.

Third case
=======
Considering the fact that as in the above case when the region was getting 
closed if there was a scanner active on those 3 compacted files. If the 
CompactedFileDischarger tries to archive those 3 files then it won't happen. 
When the parent  close() happens the scanner might be closed or not closed. In 
that case the 3 compacted files may still be available in the parent region 
directory (as it was not archived).
But the Hstore#close() is such that it will only split the store files that are 
in the SFM's store file list for creating references. So it is only the 4 
active files on which the split happens and so the daughter regions seems to 
work fine as there is no way it will know about the compacted files in the 
parent. 

In all the above things there is no clear way where I get a split failure due 
to file not found. 

How ever there is an extn to the above cases where the scan on the parent 
region is so big that after opening the daughter regions even the compaction of 
the daughter regions gets completed and so all the references to the parent 
region is removed and the Catalog Janitor sees that the parent is not having 
any references and so goes ahead and removes the region itself. (I have not 
reproduced this yet just saying theoretically). But in this case I feel even 
before this HBASE-13082 it could have happened.

Will dig in more and keep updating here. Will be back.


> Occasional failed splits on branch-1.3
> --------------------------------------
>
>                 Key: HBASE-17406
>                 URL: https://issues.apache.org/jira/browse/HBASE-17406
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Compaction, regionserver
>    Affects Versions: 1.3.0
>            Reporter: Mikhail Antonov
>             Fix For: 1.3.2
>
>
> We observed occasional (rare) failed splits on branch-1.3 builds, that might 
> be another echo of HBASE-13082.
> Essentially here's what seems to be happening:
> First on the RS hosting to-be-split parent seems to see some 
> FileNotFoundException's in the logs. It could be simple file not found on 
> some scanner path:
> {quote}
> 16/11/21 07:19:28 WARN hdfs.DFSClient: DFS Read
> java.io.FileNotFoundException: File does not exist: <path to HFile>
>       at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
>       at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> ....
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:733)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1461)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1715)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1560)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:454)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:271)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:651)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:631)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:292)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:201)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:412)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.requestSeek(StoreFileScanner.java:375)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:310)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:889)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekToNextRow(StoreScanner.java:867)
>       at org.apache.hadoop.hbase.regionserver.Stor
> {quote}
> Or it could be warning from HFileArchiver:
> {quote}
> 16/11/21 07:20:44 WARN backup.HFileArchiver: Failed to archive class 
> org.apache.hadoop.hbase.backup.HFileArchiver$FileableStoreFile, <HFile path> 
> because it does not exist! Skipping and continuing on.
> java.io.FileNotFoundException: File/Directory <HFile Path> does not exist.
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.setTimes(FSDirAttrOp.java:121)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setTimes(FSNamesystem.java:1910)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setTimes(NameNodeRpcServer.java:1223)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setTimes(ClientNamenodeProtocolServerSideTranslatorPB.java:915)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>       at or
> {quote}
> Then on the RS hosting parent I'm seeing:
> {quote}16/11/21 18:03:17 ERROR regionserver.HRegion: Could not initialize all 
> stores for the region=<region name>
> 16/11/21 18:03:17 ERROR regionserver.HRegion: Could not initialize all stores 
> for the <region name>
> 16/11/21 18:03:17 WARN ipc.Client: interrupted waiting to send rpc request to 
> server
> java.lang.InterruptedException
>       at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
>       at java.util.concurrent.FutureTask.get(FutureTask.java:191)
>       at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1060)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1455)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1413)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>       at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
>       at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>       at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
>       at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:302)
>       at com.sun.proxy.$Proxy18.getFileInfo(Unknown Source)
>       at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2112)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>       at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>       at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>       at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionFileSystem.createStoreDir(HRegionFileSystem.java:171)
>       at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:224)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:5185)
>       at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:926)
>       at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:923)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> ...
> 16/11/21 18:03:17 FATAL regionserver.HRegionServer: ABORTING region server 
> <server name>: Abort; we got an error after point-of-no-return
> {quote}
> So we've got past PONR and abort; then on the RSs where daughters are to be 
> opened I'm seeing:
> {quote}
> 16/11/21 18:03:43 ERROR handler.OpenRegionHandler: Failed open of region= 
> <region name>, starting to roll back the global memstore size.
> java.io.IOException: java.io.IOException: java.io.FileNotFoundException: File 
> does not exist: < HFile name>
>       at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
>       at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:952)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:827)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:802)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6708)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6669)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6640)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6596)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6547)
>       at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362)
>       at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> {quote}
> And regions remains offline. There's no dataloss here as daughters never open 
> up and the failed split could be recovered manually using the following 
> procedure:
>  - manually remove daughters from hbase:meta
>  - move daughter region HDFS directories out of the way
> -  delete the parent region from hbase:meta
>  - hbck -fixMeta to add the parent region back
>  - failover the active master
>  - hbck -fixAssignments after master startup



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to