[ 
https://issues.apache.org/jira/browse/HBASE-25229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225495#comment-17225495
 ] 

Jeongdae Kim commented on HBASE-25229:
--------------------------------------

I just added a 
test(https://github.com/apache/hbase/pull/2602/commits/fd31fb8cdcb5f60fcfe2baa51053c424bcf401eb|http://example.com)
 to clarify what this problem is and how this patch works.

This issue can be reproduced with my test, by changing the point when block 
cache is created (after creating an ephemeral node for a region server)

 

Please have a look at my PR, and give me any feedback.

> Instantiate BucketCache before RS creates a their ephemeral node when 
> rolling-upgrade
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-25229
>                 URL: https://issues.apache.org/jira/browse/HBASE-25229
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 1.4.13
>            Reporter: Jeongdae Kim
>            Assignee: Jeongdae Kim
>            Priority: Minor
>
> We observed many clients couldn't get information on region locations for 
> tens of seconds during rolling-upgrade from 1.2.x to 1.4.x, and all requests 
> to regions moved by graceful restart failed.
>  
> The reason is that 
> # Since HBASE-17931, system tables are assigned to RS with highest version
> # Since HBASE-12034, bucket cache initialization process has moved from RS 
> instantiation to RS initialization process after reporting to master, 
> moreover an ephemeral node for RS is created before bucket cache creation.
> # when using offheap bucketcache, it takes too much time to allocate memory 
> for it (18 seconds for 31GB in our case) 
> [https://github.com/apache/hbase/blob/branch-1.4/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteBufferArray.java#L52-L72]
> # Once ephemeral nodes created, a master try to move system regions to RS 
> with highest version when first RS restart of whole rolling-restart process. 
> but, by 3) the RS is not ready for serving system regions yet. moving system 
> regions keep failing until 3) is finished.
>  
> I think this could happen only in branch-1, because an ephemeral node is 
> created after creating block caches in hbase 2.x. there is no need to create 
> block caches after ephemeral node creation at all.
>  
> I verified this issue could be resolved by just changing their creation order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to