[
https://issues.apache.org/jira/browse/HBASE-18771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156880#comment-16156880
]
Abhishek Singh Chouhan commented on HBASE-18771:
------------------------------------------------
Didn't see your comment [~ram_krish] before i submitted mine :)
I didn't look much into how the scanner got FNFE, but it was definitely there
in the logs. The way i reproduced the scenario was by running a load and also
running a chaos monkey that did splits,merges,compactions,snapshot and also
full table scans in parallel. I'll try to dig deeper into that too.
> So who is doing this incorrect refresh? Again is that due to scan failure?
Yes in the logs i had
RpcServer.FifoWFPBQ.default.handler=87,queue=7,port=16020: callId: 12290
service: ClientService methodName: Scan size: 47 connection: 10.231.90.15:50148
deadline: 1504528119024
org.apache.hadoop.hbase.UnknownScannerException: Throwing
UnknownScannerException to reset the client scanner state for clients older
than 1.3.
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3030)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:35072)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2370)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
Caused by: java.io.IOException: unable to read store file
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.handleFileNotFound(HRegion.java:6394)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6021)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6164)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5938)
at
org.apache.phoenix.iterate.RegionScannerFactory$1.nextRaw(RegionScannerFactory.java:223)
at
org.apache.phoenix.coprocessor.DelegateRegionScanner.nextRaw(DelegateRegionScanner.java:77)
at
org.apache.phoenix.coprocessor.DelegateRegionScanner.nextRaw(DelegateRegionScanner.java:77)
at
org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:259)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2767)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2967)
... 5 more
Looks like with HBASE-17712 we no longer call refreshstorefiles in
handlefilenot found, so this might not happen.
> Incorrect StoreFileRefresh leading to split and compaction failures
> -------------------------------------------------------------------
>
> Key: HBASE-18771
> URL: https://issues.apache.org/jira/browse/HBASE-18771
> Project: HBase
> Issue Type: Bug
> Affects Versions: 1.3.1
> Reporter: Abhishek Singh Chouhan
> Assignee: Abhishek Singh Chouhan
> Priority: Blocker
> Fix For: 3.0.0, 1.4.0, 1.3.2, 1.5.0, 2.0.0-alpha-3
>
> Attachments: HBASE-18771.branch-1.3.001.patch,
> HBASE-18771.branch-1.3.002.patch
>
>
> We ran into issues of compaction and split failures with 1.3 similar to
> HBASE-18186 and HBASE-17406. Here's what i believe is happening -
> Lets say we have 4 store files that are compacted to form a new one. At this
> point we now have 5 store files, however only 1(the newly formed) is open now
> for the store and rest are waiting to get archived by HFileArchiver
> Now before the files are archived we get a FNFE in a scanner. This results in
> HRegion.RegionScannerImpl.handleFileNotFound(FileNotFoundException fnfe)
> being called which results in region.refreshStoreFiles(true) ->
> HStore.refreshStoreFiles()
> HStore.refreshStoreFiles now checks the hdfs dir and adds the previously
> compacted files back to the store, however these files are also present in
> StoreFileManager's compactedFiles list. Now at this point HFileArchiver runs,
> checks compactedFiles list and moves these files into the archive directory.
> Now when compaction runs it gets:
> 2017-09-04 12:30:13,899 ERROR [ctions-1504505399609]
> regionserver.CompactSplitThread - Compaction selection failed regionName =
> xxxx, storeName = 0, priority = 26, time = 1504528213899
> java.io.FileNotFoundException: File does not exist: hdfs://xxxx
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1337)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1329)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1329)
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:422)
> at
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.getReferencedFileStatus(StoreFileInfo.java:342)
> at
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.getFileStatus(StoreFileInfo.java:355)
> at
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.getModificationTime(StoreFileInfo.java:360)
> at
> org.apache.hadoop.hbase.regionserver.StoreFile.getModificationTimeStamp(StoreFile.java:325)
> at
> org.apache.hadoop.hbase.regionserver.StoreUtils.getLowestTimestamp(StoreUtils.java:63)
> at
> org.apache.hadoop.hbase.regionserver.compactions.RatioBasedCompactionPolicy.shouldPerformMajorCompaction(RatioBasedCompactionPolicy.java:65)
> at
> org.apache.hadoop.hbase.regionserver.compactions.SortedCompactionPolicy.selectCompaction(SortedCompactionPolicy.java:82)
> at
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.select(DefaultStoreEngine.java:107)
> at
> org.apache.hadoop.hbase.regionserver.HStore.requestCompaction(HStore.java:1679)
> Similarly if a split happens after archival we fail after PONR while opening
> daughter regions due to FNFE. This results in parent offline and daughters
> also in a limbo since they're unable to open. Since we get the error after
> PONR we also end up aborting the RS.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)