[ 
https://issues.apache.org/jira/browse/HBASE-11536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shaohui updated HBASE-11536:
--------------------------------

    Attachment: HBASE-11536-0.94-v1.diff

A patch for 0.94 using the regionserver timestamp as the version of meta put.

> Puts of region location to Meta may be out of order which causes inconsistent 
> of region location
> ------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-11536
>                 URL: https://issues.apache.org/jira/browse/HBASE-11536
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>            Reporter: Liu Shaohui
>            Priority: Critical
>         Attachments: 10.237.12.13.log, 10.237.12.15.log, 
> HBASE-11536-0.94-v1.diff
>
>
> In product hbase cluster, we found inconsistency of region location in the 
> meta table. Region cdfa2ed711bbdf054d9733a92fd43eb5 is onlined in 
> regionserver 10.237.12.13:11600 but the region location in Meta table is 
> 10.237.12.15:11600.
> This is because of the out-of-order puts for meta table.
> # HMaster try to assign the region to 10.237.12.15:11600.
> # RegionServer: 10.237.12.15:11600. During the opening the region, the put of 
> region location(10.237.12.15:11600) to meta table is timeout(60s) and the 
> htable retry for second time. (regionserver serving meta has got the request 
> of the put. The timeout is beause  ther is a bad disk in this regionserver 
> and sync of hlog is very slow. 
> )
> During the retry in htable, the OpenRegionHandler is timeout(100s) and the 
> PostOpenDeployTasksThread is interrupted. Through the htable is closed in the 
> MetaEditor finally, the share connection the htable used is not closed and 
> the call of put for meta table is on-flying in the connection. Assumed that 
> this on-flying call of put to meta is  named call A.
> # RegionServer: 10.237.12.15:11600. For the timeout of OpenRegionHandler, the 
> OpenRegionHandler marks the assign state of this region to FAILED_OPEN.
> # HMaster watchs this event of FAILED_OPEN and assigns the region to another 
> regionserver: 10.237.12.13:11600
> # RegionServer: 10.237.12.13:11600. This regionserver opens the region 
> successfully . Assumed that the put of region location(10.237.12.13:11600) to 
> meta table in this regionserver is named B.
> There is no order guarantee for call A and B. If call A is processed after 
> call B in regionserver serving meta region, the region location in meta table 
> will be wrong.
> From the raw scan of meta table we found:
> {code}
> scan '.META.', {RAW => true, LIMIT => 1, VERSIONS => 10, STARTROW => 
> 'xxx.adfa2ed711bbdf054d9733a92fd43eb5.'} 
> {code}
> {quote}
> xxx.adfa2ed711bbdf054d9733a92fd43eb5. column=info:server, 
> timestamp=1404885460553(=> Wed Jul 09 13:57:40 +0800 2014), 
> value=10.237.12.15:11600 --> Retry put from 10.237.12.15
> xxx.adfa2ed711bbdf054d9733a92fd43eb5. column=info:server, 
> timestamp=1404885456731(=> Wed Jul 09 13:57:36 +0800 2014), 
> value=10.237.12.13:11600 --> put from 10.237.12.13
>     
> xxx.adfa2ed711bbdf054d9733a92fd43eb5. column=info:server, 
> timestamp=1404885353122( Wed Jul 09 13:55:53 +0800 2014), 
> value=10.237.12.15:11600  --> First put from 10.237.12.15
> {quote}
> Related hbase log is attached in this issue and disscusions are welcomed.
> For there is no order guarantee for puts from different htables, one solution 
> for this issue is to give an increased id for each assignment of a region and 
> use this id as the timestamp of put of region location to meta table. The 
> region location with large assign id will be got by hbase clients.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to