[jira] [Resolved] (HBASE-24454) BucketCache disabled instantly before error duration toleration is reached due to timing issue

Michael Stack (Jira) Sat, 30 May 2020 21:13:04 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael Stack resolved HBASE-24454.
-----------------------------------
    Fix Version/s: 2.2.6
                   1.7.0
                   2.3.0
                   3.0.0-alpha-1
     Hadoop Flags: Reviewed
         Assignee: Michael Stack  (was: Jacob LeBlanc)
       Resolution: Fixed

Pushed on branch-2.2+ and to branch-1.



> BucketCache disabled instantly before error duration toleration is reached 
> due to timing issue
> ----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-24454
>                 URL: https://issues.apache.org/jira/browse/HBASE-24454
>             Project: HBase
>          Issue Type: Bug
>          Components: BucketCache
>    Affects Versions: 1.4.10
>         Environment: We saw this in HBase 1.4.10 (EMR 5.28.1), but I believe 
> the newest code still has this problem.
>            Reporter: Jacob LeBlanc
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.3.0, 1.7.0, 2.2.6
>
>
> We saw an instance where BucketCache was disabled after only two IO error 
> were encountered at nearly the same time. The following shows all errors from 
> a region server log for the 2020-05-26 17:00 hour. Notice that there are no 
> other errors until the 17:14:50 at which time the BucketCache is disabled 
> because it claims duration time has exceeded 60000 ms:
> $ grep ERROR hbase-hbase-regionserver-ip-172-20-113-147.log.2020-05-26-17
>  2020-05-26 17:14:50,396 ERROR [hfile-prefetch-1589117924884] 
> bucket.BucketCache: Failed reading block 
> 9adfad3a603047cfa0c98b16da0b0974_13786 from bucket cache
>  2020-05-26 17:14:50,397 ERROR 
> [regionserver/ip-172-20-113-147.us-west-2.compute.internal/172.20.113.147:16020-BucketCacheWriter-0]
>  bucket.BucketCache: Failed syncing IO engine
>  2020-05-26 17:14:50,400 ERROR [hfile-prefetch-1589117924884] 
> bucket.BucketCache: Failed reading block 
> 9adfad3a603047cfa0c98b16da0b0974_13786 from bucket cache
>  2020-05-26 17:14:50,401 ERROR [hfile-prefetch-1589117924884] 
> bucket.BucketCache: IO errors duration time has exceeded 60000ms, disabling 
> cache, please check your IOEngine
> The region server is very busy so it should be constantly getting successful 
> reads and writes in the bucket cache (it is not as though there was some long 
> ago error and then no successful IO to clear the error flag).
> We are running a busy EMR cluster backed by S3. A bucketcache getting 
> disabled is a huge performance issue because all reads must go to S3.
> Looking at the code, I believe I've found a timing issue. Here is the code 
> for checking and setting the ioErrorStartTime (taken from BucketCache.java):
>  
> {code:java}
>   /**
>    * Check whether we tolerate IO error this time. If the duration of IOEngine
>    * throwing errors exceeds ioErrorsDurationTimeTolerated, we will disable 
> the
>    * cache
>    */
>   private void checkIOErrorIsTolerated() {
>     long now = EnvironmentEdgeManager.currentTime();
>     if (this.ioErrorStartTime > 0) {
>       if (cacheEnabled && (now - ioErrorStartTime) > 
> this.ioErrorsTolerationDuration) {
>         LOG.error("IO errors duration time has exceeded " + 
> ioErrorsTolerationDuration +
>           "ms, disabling cache, please check your IOEngine");
>         disableCache();
>       }
>     } else {
>       this.ioErrorStartTime = now;
>     }
>   }
> {code}
>  
> And here is the code for clearing the ioErrorStartTime when a successful read 
> or write is done:
> {code:java}
>   if (this.ioErrorStartTime > 0) {
>     ioErrorStartTime = -1;
>   }
> {code}
> Notice that that if ioErrorStartTime is set to -1 after the first if 
> statement in checkIOErrorIsTolerated but before (now - ioErrorStartTime), 
> then (now - ioErrorStartTime) will evaluate to (now + 1) resulting in the 
> code thinking it has exceeded ioErrorsTolerationDuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HBASE-24454) BucketCache disabled instantly before error duration toleration is reached due to timing issue

Reply via email to