[
https://issues.apache.org/jira/browse/HBASE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082970#comment-13082970
]
ramkrishna.s.vasudevan commented on HBASE-4015:
-----------------------------------------------
@Stack
Thanks for your review.
{noformat}
The core prob. as per J-D above is that state transitions happen fine out on
the regionserver but the master lags processing them; meantime the timeout
monitor runs and presumes since its not seen the transition (that is likely in
queue to process), it preempts znode setting it OFFLINE.
{noformat}
I would like to clarify one problem here
->Timeout monitor {color:red}DOESNOT{color} preempt an znode to OFFLINE if in
PENDING_OPEN state.
' assign(e.getKey(), false, e.getValue());'
Here we pass false for the setOfflineInZK.
If you see the comments in HBASE-3937 JD had pointed out like making this
{color:red}'true' will lead to double assignment.{color}
Our soln has been drafted after carefully analysing and reproducing based on
JD's comment in HBASE-3937
and logs from the issue.
I would like to discuss on logs the of HBASE-3937 to be more specific on why
the change has been done like this.
{noformat}
RS1 logs
========
2011-05-20 15:48:02,879 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:20020-0x3300c164fe0002c Successfully transitioned node
d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
RS_ZK_REGION_OPENING
2011-05-20 15:48:02,879 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Opening region: REGION => {NAME =>
'ufdr,010066,1305873715825.9361f58931a310a62c15f501ce3261b6.', STARTKEY =>
'010066', ENDKEY => '010068', ENCODED => 9361f58931a310a62c15f501ce3261b6,
TABLE => {{NAME => 'ufdr', FAMILIES => [{NAME => 'value', BLOOMFILTER =>
'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'GZ', TTL =>
'432000', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
2011-05-20 15:48:02,879 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Opening region: REGION => {NAME =>
'ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.', STARTKEY =>
'001570', ENDKEY => '001572', ENCODED => d7555a12586e6c788ca55017224b5a51,
TABLE => {{NAME => 'ufdr', FAMILIES => [{NAME => 'value', BLOOMFILTER =>
'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'GZ', TTL =>
'432000', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
{noformat}
{noformat}
2011-05-20 15:49:58,134 ERROR
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of
region=ufdr,010066,1305873715825.9361f58931a310a62c15f501ce3261b6.
java.io.IOException: Exception occured while connecting to the server
at
com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.retryOperation(RPCRetryAndSwitchInvoker.java:162)
at
com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.handleFailure(RPCRetryAndSwitchInvoker.java:118)
at
com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:95)
at $Proxy6.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:889)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:724)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:812)
at
org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:409)
{noformat}
{noformat}
RS1 logs (Failed to open here)
==============================
2011-05-20 17:00:37,753 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
transition from OFFLINE to OPENING for region=9361f58931a310a62c15f501ce3261b6
2011-05-20 17:00:37,753 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was
hijacked? It no longer exists, encodedName=9361f58931a310a62c15f501ce3261b6
{noformat}
{noformat}
RS2 logs (Failed to open here)
=============================
2011-05-20 16:54:41,385 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
transition from OFFLINE to OPENING for region=9361f58931a310a62c15f501ce3261b6
2011-05-20 16:54:41,385 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was
hijacked? It no longer exists, encodedName=9361f58931a310a62c15f501ce3261b6
{noformat}
{noformat}
RS3 logs (Failed to open here)
==============================
2011-05-20 16:45:29,477 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:45:29,477 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was
hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
{noformat}
Here the state of the node is changed from OFFLINE to OPENING.
But before the operation has failed the Timeoutmonitor deducts it but thinks it
to be PENDING_OPEN though
the change has happened to RS_OPENING in ZNode.
May be if it had found it to be OPENING then it would have moved it to OFFLINE
state in znode.
As master doesnot preempt to OFFLINE and the master in memory state is
PENDING_OPEN(Here there is no one to remove the RIT so the timoutmonitor
continues to deduct as PENDING_OPEN), everytime this region is considered to be
hijacked and no one processes
it. Even if the request goes to the same RS.
I just want to reiterate the comments given by JD in the defect HBASE-3937.
{noformat}
It should not create a list of unassigns and assigns, since by the time the
list is processed the situation probably changed (I witnessed that a lot).
This means the action should be taken as we go through first loop.'
{noformat}
This is what our patch does. No batch processing is done. We try to take
action as and when we deduct a timeout has occured.
But again this may not be fool proof. Because there is a chance like as in the
above case the timeout
deducts it to be PENDING_OPEN but as we dont move the node to OFFLINE(In
master's memory the state
is PENDING_OPEN) the RS will say the region is hijacked
as it cannot transit from OFFLINE to OPENING and hence the problem prevails.
Coming back to JD's comments again
{noformat}
One of the major issues is the lack of atomicity, so any action taken should
first check the current state, keep the version number, decide of the
corrective measure and update the znode by expecting the version it first got.
{noformat}
Now comes the issue of how to know the exact state that the znode is currently
in and what action to take.
If we again move to OFFLINE then there may be problem of double assignment.
So we need to manage with some versions as JD told.
Here instead of versions we opted for a new state following are the reasons
{noformat}
// Initialize the znode version.
this.version =
ZKAssign.transitionNodeOpening(server.getZooKeeper(),
regionInfo, server.getServerName());
{noformat}
->RS doesnt have any previous version history here. So comparing the new
OFFLINE state and the
prev offline state may be tricky.(needs some tweaking)
->Introducing an intermediate state here would bring more clarity to the code
and system.
Thats why we planned to introduce RE_ALLOCATE. Adding the servername is an
additional precaution.
{noformat}
If the updating of the znode is successful, we know for sure that the operation
will be seen by the region servers.
{noformat}
So now what we do is deduct timeout, try moving the state of znode to a
RE_ALLOCATE.
If really successfull all the RS will know that some updation has happened.
Now another RS(or same RS) has the chance to operate on this new state and
will not say as hijacked.
{noformat}
If it's not successful, the situation needs to be reassessed.
{noformat}
If changing state to RE_ALLOCATE is not successful then what. Now master is
aware the RS has operated on the region and changed it to another state may to
OPENING or OPENED.
As we cannot move the state to OFFLINE in zknode we are forced to have some
mechanism between
the RS and master to handle this problem. Hence the new state RE_ALLOCATE
came into
picture.
Thus our current implementation not only address the time lag but also clear
atomicity is maintained.
Stack, am i clear in my explanation?
Actually before proposing the soln we went thro JD's comments analysed the logs
and then we wnated
to take care of all the comments as per JD and infact reproduced all the
problems.
> Refactor the TimeoutMonitor to make it less racy
> ------------------------------------------------
>
> Key: HBASE-4015
> URL: https://issues.apache.org/jira/browse/HBASE-4015
> Project: HBase
> Issue Type: Sub-task
> Affects Versions: 0.90.3
> Reporter: Jean-Daniel Cryans
> Assignee: ramkrishna.s.vasudevan
> Priority: Blocker
> Fix For: 0.92.0
>
> Attachments: HBASE-4015_1_trunk.patch, Timeoutmonitor with state
> diagrams.pdf
>
>
> The current implementation of the TimeoutMonitor acts like a race condition
> generator, mostly making things worse rather than better. It does it's own
> thing for a while without caring for what's happening in the rest of the
> master.
> The first thing that needs to happen is that the regions should not be
> processed in one big batch, because that sometimes can take minutes to
> process (meanwhile a region that timed out opening might have opened, then
> what happens is it will be reassigned by the TimeoutMonitor generating the
> never ending PENDING_OPEN situation).
> Those operations should also be done more atomically, although I'm not sure
> how to do it in a scalable way in this case.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira