[
https://issues.apache.org/jira/browse/HBASE-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990793#comment-12990793
]
stack commented on HBASE-3368:
------------------------------
Talking w/ Jon, we should look into creating new SPLITTING state only its the
RS that sets this state rather than Master that is orchestrating state in ZK.
We'd need to deal with Master doing an unassign while a split was going on
(We'd need to be able to reject a close). There are probably other edge cases
to consider; e.g. when a region in RIT, balancer won't run. Upsides are that
this seems to fit naturally under regions-in-transition umbrella (though the
dir up in zk is called 'unassigned' -- we should change that).
Talking more, the way I'm going could have issues. We can't guarantee that we
won't have the same issue on occasion; e.g. the open state is handled by an
executor and in frentic times, executors may be backedup.... whereas handling
of the split would be done up in the zk callback. This would seem to indicate
that split handling too should be done in an executor -- on both sides for the
transaction.
So, some exploration, and even then the patch is starting to look big.
> Split message can come in before region opened message; results in 'Region
> has been PENDING_CLOSE for too long' cycle
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-3368
> URL: https://issues.apache.org/jira/browse/HBASE-3368
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Assignee: stack
> Priority: Critical
> Fix For: 0.92.0
>
> Attachments: 3368-v2.txt, 3368.txt
>
>
> Another good one. Look at these excerpts from master log:
> {code}
> 2010-12-16 00:49:45,749 INFO org.apache.hadoop.hbase.master.ServerManager:
> Received REGION_SPLIT:
> TestTable,0078922610,1292373363753.490b382bae33642d12cd717b5785698b.:
> Daughters;
> TestTable,0078922610,1292460584999.c8b95dfc9a671083bafdaa0341279777.,
> TestTable,0078933586,
> 1292460584999.7cc636c9a7274eec4e784df2efebbca3. from
> XXX185,60020,1292460570976
> ....
> 2010-12-16 00:49:46,132 DEBUG
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region
> TestTable,0078922610,1292373363753.490b382bae33642d12cd717b5785698b. on
> XXX185,60020,1292460570976
> {code}
> ... so the split will have cleared the parent from in-memory data structures
> and then the open handler will add them back (though region is offlined,
> split).
> Then the balancer runs....... only no one is holding the region thats being
> balanced.
> Over on XXX185 I see the open and then split at these times:
> {code}
> 2010-12-16 00:49:43,740 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opened
> TestTable,0078922610,1292373363753.490b382bae33642d12cd717b5785698b.
> .....
> 2010-12-16 00:49:45,003 INFO
> org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of
> region TestTable,0078922610,1292373363753.490b382bae33642d12cd717b5785698b.
> {code}
> So, the fact that it takes the Master a while to get around to the zk watcher
> processing messes us up.
> Root problem is that we're using two different message buses, zk and then
> heartbeat. Intent is to do all over zk and remove hearbeat but looking at
> what to do for 0.90.0.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira