[ 
https://issues.apache.org/jira/browse/HBASE-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050665#comment-13050665
 ] 

Jean-Daniel Cryans commented on HBASE-3994:
-------------------------------------------

I'm still digging in my logs, but it appears that the region server took 40 
secs to open a single file from one of the daughters and that's why the clients 
eventually ran out of retries. It seems at first that it didn't retry at all, 
but now I think we should just have a better error message.

> SplitTransaction has a window where clients can get RegionOfflineException
> --------------------------------------------------------------------------
>
>                 Key: HBASE-3994
>                 URL: https://issues.apache.org/jira/browse/HBASE-3994
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.90.4
>
>
> I just witnessed a job having failed tasks because of RegionOfflineException. 
> This should normally happen because the table is disabled, but this can also 
> happen because the parent is offline. Probably 99.999% of the time users 
> don't hit it because SplitTransaction is able to offline the parent and add 
> the first daughter quickly enough, but in my case the cluster was so slow 
> that I was able to see.
> Maybe we should check in HCM not only if the region is offline but also if 
> it's split, in which case we should retry?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to