[jira] [Commented] (HBASE-4015) Refactor the TimeoutMonitor to make it less racy

ramkrishna.s.vasudevan (JIRA) Thu, 11 Aug 2011 00:08:06 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082970#comment-13082970
 ]


ramkrishna.s.vasudevan commented on HBASE-4015:
-----------------------------------------------

@Stack
Thanks for your review.

{noformat}
The core prob. as per J-D above is that state transitions happen fine out on 
the regionserver but the master lags processing them; meantime the timeout 
monitor runs and presumes since its not seen the transition (that is likely in 
queue to process), it preempts znode setting it OFFLINE.
{noformat}

I would like to clarify one problem here
->Timeout monitor {color:red}DOESNOT{color} preempt an znode to OFFLINE if in 
PENDING_OPEN state.
' assign(e.getKey(), false, e.getValue());'
Here we pass false for the setOfflineInZK.
If you see the comments in HBASE-3937 JD had pointed out like making this 
{color:red}'true' will lead to double assignment.{color}

Our soln has been drafted after carefully analysing and reproducing based on 
JD's comment in HBASE-3937 
and logs from the issue.


I would like to discuss on logs the of HBASE-3937 to be more specific on why 
the change has been done like this.
{noformat}
RS1 logs
========
2011-05-20 15:48:02,879 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:20020-0x3300c164fe0002c Successfully transitioned node 
d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to 
RS_ZK_REGION_OPENING
2011-05-20 15:48:02,879 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Opening region: REGION => {NAME => 
'ufdr,010066,1305873715825.9361f58931a310a62c15f501ce3261b6.', STARTKEY => 
'010066', ENDKEY => '010068', ENCODED => 9361f58931a310a62c15f501ce3261b6, 
TABLE => {{NAME => 'ufdr', FAMILIES => [{NAME => 'value', BLOOMFILTER => 
'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'GZ', TTL => 
'432000', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
2011-05-20 15:48:02,879 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Opening region: REGION => {NAME => 
'ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.', STARTKEY => 
'001570', ENDKEY => '001572', ENCODED => d7555a12586e6c788ca55017224b5a51, 
TABLE => {{NAME => 'ufdr', FAMILIES => [{NAME => 'value', BLOOMFILTER => 
'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'GZ', TTL => 
'432000', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
{noformat}
{noformat}


2011-05-20 15:49:58,134 ERROR 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of 
region=ufdr,010066,1305873715825.9361f58931a310a62c15f501ce3261b6.
java.io.IOException: Exception occured while connecting to the server
        at 
com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.retryOperation(RPCRetryAndSwitchInvoker.java:162)
        at 
com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.handleFailure(RPCRetryAndSwitchInvoker.java:118)
        at 
com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:95)
        at $Proxy6.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:889)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:724)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:812)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:409)
{noformat}
{noformat}
RS1 logs (Failed to open here)
==============================
2011-05-20 17:00:37,753 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed 
transition from OFFLINE to OPENING for region=9361f58931a310a62c15f501ce3261b6
2011-05-20 17:00:37,753 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was 
hijacked? It no longer exists, encodedName=9361f58931a310a62c15f501ce3261b6
{noformat}
{noformat}
RS2 logs (Failed to open here)
=============================
2011-05-20 16:54:41,385 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed 
transition from OFFLINE to OPENING for region=9361f58931a310a62c15f501ce3261b6
2011-05-20 16:54:41,385 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was 
hijacked? It no longer exists, encodedName=9361f58931a310a62c15f501ce3261b6
{noformat}
{noformat}
RS3 logs (Failed to open here)
==============================
2011-05-20 16:45:29,477 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed 
transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:45:29,477 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was 
hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
{noformat}

Here the state of the node is changed from OFFLINE to OPENING.
But before the operation has failed the Timeoutmonitor deducts it but thinks it 
to be PENDING_OPEN though
the change has happened to RS_OPENING in ZNode.
May be if it had found it to be OPENING then it would have moved it to OFFLINE 
state in znode.

As master doesnot preempt to OFFLINE and the master in memory state is 
PENDING_OPEN(Here there is no one to remove the RIT so the timoutmonitor 
continues to deduct as PENDING_OPEN), everytime this region is considered to be 
hijacked and no one processes
it.  Even if the request goes to the same RS.

I just want to reiterate the comments given by JD in the defect HBASE-3937.
{noformat}
 It should not create a list of unassigns and assigns, since by the time the 
list is processed the situation probably changed (I witnessed that a lot).
 This means the action should be taken as we go through first loop.'
{noformat}
 This is what our patch does.  No batch processing is done.  We try to take 
action as and when we deduct a timeout has occured.
 But again this may not be fool proof. Because there is a chance like as in the 
above case the timeout
 deducts it to be PENDING_OPEN but as we dont move the node to OFFLINE(In 
master's memory the state
 is PENDING_OPEN) the RS will say the region is hijacked
 as it cannot transit from OFFLINE to OPENING and hence the problem prevails.


 Coming back to JD's comments again
{noformat}
 One of the major issues is the lack of atomicity, so any action taken should 
first check the current state, keep the version number, decide of the 
corrective measure and update the znode by expecting the version it first got.
 {noformat}

 Now comes the issue of how to know the exact state that the znode is currently 
in and what action to take.
 If we again move to OFFLINE then there may be problem of double assignment.
 So we need to manage with some versions as JD told.  
 Here instead of versions we opted for a new state following are the reasons
 {noformat}
        // Initialize the znode version.
      this.version =
        ZKAssign.transitionNodeOpening(server.getZooKeeper(),
          regionInfo, server.getServerName());
 {noformat}
 ->RS doesnt have any previous version history here.  So comparing the new 
OFFLINE state and the
 prev offline state may be tricky.(needs some tweaking)
 ->Introducing an intermediate state here would bring more clarity to the code 
and system.
 Thats why we planned to introduce RE_ALLOCATE. Adding the servername is an 
additional precaution.
{noformat}
If the updating of the znode is successful, we know for sure that the operation 
will be seen by the region servers.
{noformat}


 So now what we do is deduct timeout, try moving the state of znode to a 
RE_ALLOCATE.
 If really successfull all the RS will know that some updation has happened.
 Now another RS(or same RS) has the chance to operate on this new state and 
will not say as hijacked.
{noformat}
If it's not successful, the situation needs to be reassessed.
{noformat}
 If changing state to RE_ALLOCATE is not successful then what. Now master is 
aware the RS has operated on the region and changed it to another state may to 
OPENING or OPENED.

 As we cannot move the state to OFFLINE in zknode we are forced to have some 
mechanism between
 the RS and master to handle this problem.  Hence the new state RE_ALLOCATE 
came into
 picture.

 Thus our current implementation not only address the time lag but also clear 
atomicity is maintained.

Stack, am i clear in my explanation?
Actually before proposing the soln we went thro JD's comments analysed the logs 
and then we wnated
to take care of all the comments as per JD and infact reproduced all the 
problems.


> Refactor the TimeoutMonitor to make it less racy
> ------------------------------------------------
>
>                 Key: HBASE-4015
>                 URL: https://issues.apache.org/jira/browse/HBASE-4015
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4015_1_trunk.patch, Timeoutmonitor with state 
> diagrams.pdf
>
>
> The current implementation of the TimeoutMonitor acts like a race condition 
> generator, mostly making things worse rather than better. It does it's own 
> thing for a while without caring for what's happening in the rest of the 
> master.
> The first thing that needs to happen is that the regions should not be 
> processed in one big batch, because that sometimes can take minutes to 
> process (meanwhile a region that timed out opening might have opened, then 
> what happens is it will be reassigned by the TimeoutMonitor generating the 
> never ending PENDING_OPEN situation).
> Those operations should also be done more atomically, although I'm not sure 
> how to do it in a scalable way in this case.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4015) Refactor the TimeoutMonitor to make it less racy

Reply via email to