[
https://issues.apache.org/jira/browse/HBASE-27484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wellington Chevreuil resolved HBASE-27484.
------------------------------------------
Resolution: Fixed
Merged to master, branch-2, branch-2.5 and branch-2.4. Thanks for reviewing it,
[~psomogyi] !
> FNFE on StoreFileScanner after a flush followed by a compaction
> ---------------------------------------------------------------
>
> Key: HBASE-27484
> URL: https://issues.apache.org/jira/browse/HBASE-27484
> Project: HBase
> Issue Type: Bug
> Reporter: Wellington Chevreuil
> Assignee: Wellington Chevreuil
> Priority: Major
>
> One of our customers was running SyncTable from a 1.2 based cluster, where
> SyncTable map tasks were open scanners on a 2.4 based cluster for comparing
> the two clusters. Few of the map tasks failed with a DoNotRetryException
> caused by a FileNotFoundException blowing all the way up to the client:
> {noformat}
> Error: org.apache.hadoop.hbase.DoNotRetryIOException:
> org.apache.hadoop.hbase.DoNotRetryIOException: java.io.FileNotFoundException:
> open
> s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0
> at 7225 on
> s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0:
> com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does
> not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey;
> Request ID: KBRNC67WZGCS4SCF; S3 Extended Request ID:
> wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=;
> Proxy: null), S3 Extended Request ID:
> wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=:NoSuchKey
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3712)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45819)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:392)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:140)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:359)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:339)
> Caused by: java.io.FileNotFoundException: open
> s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0
> at 7225 on
> s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0:
> com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does
> not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey;
> Request ID: KBRNC67WZGCS4SCF; S3 Extended Request ID:
> wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=;
> Proxy: null), S3 Extended Request ID:
> wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=:NoSuchKey
> ...
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:632)
> at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:315)
> at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:216)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:417)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.reopenAfterFlush(StoreScanner.java:1018)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:552)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:155)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:7399)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:7567)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:7331)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3373)
> {noformat}
> We can see on the RS logs that the above file got recently create as an
> outcome of a memstore flush, then compaction is triggered shortly:
> {noformat}
> 2022-11-11 22:16:50,322 INFO
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed memstore
> data size=208.15 KB at sequenceid=4949703 (bloomFilter=false),
> to=s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0
> 2022-11-11 22:16:50,513 INFO org.apache.hadoop.hbase.regionserver.HStore:
> Added
> s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0,
> entries=951, sequenceid=4949703, filesize=26.2 K
> ...
> 2022-11-11 22:16:50,791 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Starting compaction of 4c53da8c2ab9b7d7a0d6046ef3bb701c/0 in
> xxxxxx,IT001E90506702\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1618275339031.4c53da8c2ab9b7d7a0d6046ef3bb701c.
> 2022-11-11 22:16:50,791 INFO org.apache.hadoop.hbase.regionserver.HStore:
> Starting compaction of
> [s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/9a865e5e3ba74ab6b9597fe0e295fd8d,
>
> s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/adbe118d8c76482d8faed3170437be0b,
>
> s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0]
> into
> tmpdir=s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/.tmp,
> totalSize=27.1 M
> ...
> 2022-11-11 22:16:54,469 INFO org.apache.hadoop.hbase.regionserver.HStore:
> Completed compaction of 3 file(s) in 4c53da8c2ab9b7d7a0d6046ef3bb701c/0 of
> 4c53da8c2ab9b7d7a0d6046ef3bb701c into
> bc44bfd06a03429fb331ab4dc10dbb8d(size=27.1 M), total size for store is 4.9 G.
> This selection was in queue for 0sec, and took 3sec to execute.
> {noformat}
> I believe this is an unlucky scenario where the compaction discharger moved
> the compacted away files while the StoreFileScanner was getting created but
> had not updated the refCounter on the file reader yet. We couldn't reproduce
> this on a real cluster, but I could emulate it with a UT and artificially
> inducing a delay in the StoreFileScanner creation when creating scans not for
> compactions. One possible fix is to update the reader refCounter as soon we
> get the files for the StoreFileScanner instances we are creating.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)