Wellington Chevreuil created HBASE-27484:
--------------------------------------------
Summary: FNFE on StoreFileScanner after a flush followed by a
compaction
Key: HBASE-27484
URL: https://issues.apache.org/jira/browse/HBASE-27484
Project: HBase
Issue Type: Bug
Reporter: Wellington Chevreuil
Assignee: Wellington Chevreuil
One of our customers was running SyncTable from a 1.2 based cluster, where
SyncTable map tasks were open scanners on a 2.4 based cluster for comparing the
two clusters. Few of the map tasks failed with a DoNotRetryException caused by
a FileNotFoundException blowing all the way up to the client:
{noformat}
Error: org.apache.hadoop.hbase.DoNotRetryIOException:
org.apache.hadoop.hbase.DoNotRetryIOException: java.io.FileNotFoundException:
open
s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0
at 7225 on
s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0:
com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not
exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request
ID: KBRNC67WZGCS4SCF; S3 Extended Request ID:
wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=;
Proxy: null), S3 Extended Request ID:
wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=:NoSuchKey
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3712)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45819)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:392)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:140)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:359)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:339)
Caused by: java.io.FileNotFoundException: open
s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0
at 7225 on
s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0:
com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not
exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request
ID: KBRNC67WZGCS4SCF; S3 Extended Request ID:
wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=;
Proxy: null), S3 Extended Request ID:
wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=:NoSuchKey
...
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:632)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:315)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:216)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:417)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.reopenAfterFlush(StoreScanner.java:1018)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:552)
at
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:155)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:7399)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:7567)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:7331)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3373)
{noformat}
We can see on the RS logs that the above file got recently create as an outcome
of a memstore flush, then compaction is triggered shortly:
{noformat}
2022-11-11 22:16:50,322 INFO
org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed memstore data
size=208.15 KB at sequenceid=4949703 (bloomFilter=false),
to=s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0
2022-11-11 22:16:50,513 INFO org.apache.hadoop.hbase.regionserver.HStore: Added
s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0,
entries=951, sequenceid=4949703, filesize=26.2 K
...
2022-11-11 22:16:50,791 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Starting compaction of 4c53da8c2ab9b7d7a0d6046ef3bb701c/0 in
xxxxxx,IT001E90506702\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1618275339031.4c53da8c2ab9b7d7a0d6046ef3bb701c.
2022-11-11 22:16:50,791 INFO org.apache.hadoop.hbase.regionserver.HStore:
Starting compaction of
[s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/9a865e5e3ba74ab6b9597fe0e295fd8d,
s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/adbe118d8c76482d8faed3170437be0b,
s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0]
into
tmpdir=s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/.tmp,
totalSize=27.1 M
...
2022-11-11 22:16:54,469 INFO org.apache.hadoop.hbase.regionserver.HStore:
Completed compaction of 3 file(s) in 4c53da8c2ab9b7d7a0d6046ef3bb701c/0 of
4c53da8c2ab9b7d7a0d6046ef3bb701c into
bc44bfd06a03429fb331ab4dc10dbb8d(size=27.1 M), total size for store is 4.9 G.
This selection was in queue for 0sec, and took 3sec to execute.
{noformat}
I believe this is an unlucky scenario where the compaction discharger moved the
compacted away files while the StoreFileScanner was getting created but had not
updated the refCounter on the file reader yet. We couldn't reproduce this on a
real cluster, but I could emulate it with a UT and artificially inducing a
delay in the StoreFileScanner creation when creating scans not for compactions.
One possible fix is to update the reader refCounter as soon we get the files
for the StoreFileScanner instances we are creating.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)