[
https://issues.apache.org/jira/browse/HBASE-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Stack resolved HBASE-24454.
-----------------------------------
Fix Version/s: 2.2.6
1.7.0
2.3.0
3.0.0-alpha-1
Hadoop Flags: Reviewed
Assignee: Michael Stack (was: Jacob LeBlanc)
Resolution: Fixed
Pushed on branch-2.2+ and to branch-1.
> BucketCache disabled instantly before error duration toleration is reached
> due to timing issue
> ----------------------------------------------------------------------------------------------
>
> Key: HBASE-24454
> URL: https://issues.apache.org/jira/browse/HBASE-24454
> Project: HBase
> Issue Type: Bug
> Components: BucketCache
> Affects Versions: 1.4.10
> Environment: We saw this in HBase 1.4.10 (EMR 5.28.1), but I believe
> the newest code still has this problem.
> Reporter: Jacob LeBlanc
> Assignee: Michael Stack
> Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0, 1.7.0, 2.2.6
>
>
> We saw an instance where BucketCache was disabled after only two IO error
> were encountered at nearly the same time. The following shows all errors from
> a region server log for the 2020-05-26 17:00 hour. Notice that there are no
> other errors until the 17:14:50 at which time the BucketCache is disabled
> because it claims duration time has exceeded 60000 ms:
> $ grep ERROR hbase-hbase-regionserver-ip-172-20-113-147.log.2020-05-26-17
> 2020-05-26 17:14:50,396 ERROR [hfile-prefetch-1589117924884]
> bucket.BucketCache: Failed reading block
> 9adfad3a603047cfa0c98b16da0b0974_13786 from bucket cache
> 2020-05-26 17:14:50,397 ERROR
> [regionserver/ip-172-20-113-147.us-west-2.compute.internal/172.20.113.147:16020-BucketCacheWriter-0]
> bucket.BucketCache: Failed syncing IO engine
> 2020-05-26 17:14:50,400 ERROR [hfile-prefetch-1589117924884]
> bucket.BucketCache: Failed reading block
> 9adfad3a603047cfa0c98b16da0b0974_13786 from bucket cache
> 2020-05-26 17:14:50,401 ERROR [hfile-prefetch-1589117924884]
> bucket.BucketCache: IO errors duration time has exceeded 60000ms, disabling
> cache, please check your IOEngine
> The region server is very busy so it should be constantly getting successful
> reads and writes in the bucket cache (it is not as though there was some long
> ago error and then no successful IO to clear the error flag).
> We are running a busy EMR cluster backed by S3. A bucketcache getting
> disabled is a huge performance issue because all reads must go to S3.
> Looking at the code, I believe I've found a timing issue. Here is the code
> for checking and setting the ioErrorStartTime (taken from BucketCache.java):
>
> {code:java}
> /**
> * Check whether we tolerate IO error this time. If the duration of IOEngine
> * throwing errors exceeds ioErrorsDurationTimeTolerated, we will disable
> the
> * cache
> */
> private void checkIOErrorIsTolerated() {
> long now = EnvironmentEdgeManager.currentTime();
> if (this.ioErrorStartTime > 0) {
> if (cacheEnabled && (now - ioErrorStartTime) >
> this.ioErrorsTolerationDuration) {
> LOG.error("IO errors duration time has exceeded " +
> ioErrorsTolerationDuration +
> "ms, disabling cache, please check your IOEngine");
> disableCache();
> }
> } else {
> this.ioErrorStartTime = now;
> }
> }
> {code}
>
> And here is the code for clearing the ioErrorStartTime when a successful read
> or write is done:
> {code:java}
> if (this.ioErrorStartTime > 0) {
> ioErrorStartTime = -1;
> }
> {code}
> Notice that that if ioErrorStartTime is set to -1 after the first if
> statement in checkIOErrorIsTolerated but before (now - ioErrorStartTime),
> then (now - ioErrorStartTime) will evaluate to (now + 1) resulting in the
> code thinking it has exceeded ioErrorsTolerationDuration.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)