[
https://issues.apache.org/jira/browse/HBASE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090021#comment-13090021
]
stack commented on HBASE-4015:
------------------------------
@Ted In most other state transitions, we have the znode version to hand (we
previously read the znode state). Its this one operation where we are
transferring region ownership from master to regionserver that has the hole. A
new state "seems" to be a patch on the actual problem but I could be convinced
otherwise (Looking at Ram's diagram the new state seemed cleaner but I had
sneaking suspicion that the new state introduces a bunch of new conditions not
dealt with in the diagram).
On the patch (Thanks for digging in here Ram):
Why the special handling of meta regions here:
{code}
- regionsInTransition.put(encodedRegionName, new RegionState(
- regionInfo, RegionState.State.OPENING,
- data.getStamp(), data.getOrigin()));
+ if (regionInfo.isMetaRegion() || regionInfo.isRootRegion()) {
+ regionsInTransition.put(encodedRegionName, new
RegionState(regionInfo,
+ RegionState.State.OPENING, data.getStamp(), data.getOrigin()));
+ processOpeningState(regionInfo);
+ break;
+ }
+ regionsInTransition.put(encodedRegionName, new RegionState(regionInfo,
+ RegionState.State.OPENING, data.getStamp(), data.getOrigin()));
{code}
FYI, IIRC, regionInfo.isMetaRegion() is true if .META. or -ROOT-
This is ok, not setting a watcher here?
{code}
+ RegionTransitionData dataInZNode = ZKAssign.getDataNoWatch(watcher, node,
+ stat);
{code}
Here you might want to add logging whether or not its a reallocate:
{code}
- LOG.info("Table " + tableName + (disabled? " disabled;": " disabling;") +
- " skipping assign of " + region.getRegionNameAsString());
+ LOG.info("Table " + tableName + (disabled ? " disabled;" : " disabling;")
+ + " skipping assign of " + region.getRegionNameAsString());
{code}
FYI, I prefer the previous style where the '+' is at the end of the line; it
flags reader that there is more to come.... (IMO).
What is happening here? Who will have set realloc true? The timeout monitor?
So who then will set it to OFFLINE?
{code}
- LOG.debug("Forcing OFFLINE; was=" + state);
- state.update(RegionState.State.OFFLINE);
+ // If invoked from timeout monitor donot force it to OFFLINE. Based on
the
+ // state
+ // we will decide if to change in-memory state to OFFLINE or not.
+ if (!isReAllocate) {
+ LOG.debug("Forcing OFFLINE; was=" + state);
+ state.update(RegionState.State.OFFLINE);
+ }
{code}
When would this happen:
{code}
+ if (setOfflineInZK && versionOfOfflineNode == -1)
+ return;
{code}
i.e... it would come back w/ -1?
Style: If multiple lines, it needs curly brackets (Would suggest leaving stuff
like this alone)
{code}
- if (plan == null) return; // Should get reassigned later when RIT times
out.
+ if (plan == null)
+ return; // Should get reassigned later when RIT times out.
{code}
At the other extreme, its fine to fix stuff like this where its w/o spaces
around the '+' operator:
{code}
- if(LOG.isDebugEnabled()){
- LOG.debug("The unassigned node "+encodedRegionName+" doesnot
exist.");
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("The unassigned node " + encodedRegionName
+ + " doesnot exist.");
{code}
(though you could separate the 'doesnot' into 'does not'
This kinda change just bloats your patch (making Ted think it does more than it
does -- smile):
{code}
- LOG.warn("Failed assignment of " +
- state.getRegion().getRegionNameAsString() + " to " +
- plan.getDestination() + ", trying to assign elsewhere instead; " +
- "retry=" + i, t);
+ LOG.warn("Failed assignment of "
+ + state.getRegion().getRegionNameAsString() + " to "
+ + plan.getDestination() + ", trying to assign elsewhere instead; "
+ + "retry=" + i, t);
{code}
ditto
{code}
- LOG.warn("Unable to find a viable location to assign region " +
- state.getRegion().getRegionNameAsString());
+ LOG.warn("Unable to find a viable location to assign region "
+ + state.getRegion().getRegionNameAsString());
{code}
Does the param 'isReAllocate' mean method was called from timeout monitor? If
so, it probably warrants a different name?
What is going to happen if we return -1 because of below?
{code}
+ if (versionOfOfflineNode == -1) {
+ LOG.warn("Attempted to create/force node into OFFLINE state before "
+ + "completing assignment but failed to do so for " + state);
+ return -1;
}
{code}
+1 on refactor of timeout breaking out the method actOnTimeout.
Did you change anything in way timeout monitor works?
This is interesting:
{code}
+ public static enum TimeOutOperationType {
+ ASSIGN, UNASSIGN;
+ }
{code}
We have timeout types. How does this work? Is this an in-memory master state
only? What happens if master crashes? Will new master need to know these two
new states?
Why are we passing AssignmentManager a threadPoolExecutorService? Its used by
timeout monitor? Should creation of the executor service just live in the
timeout monitor and be constructed there rather than up in master and passed
in? Or is it that you need to be able to shut it down when master goes down?
Add a comment on TimeOutManagerCallable class saying what its for.
This is great:
{code}
+ if (!transitionZookeeperOfflineToOpening(encodedName,
+ versionOfOfflineNode)) {
{code}
... over on the RS.
Is this code copy paste from the method above it? The version that does not
take a znode version?
{code}
+ public RegionOpeningState openRegion(HRegionInfo region, int expectedVersion)
{code}
If so, we should fix that.
This patch needs closer review and I'd like to know how this executor service
in timeout monitor solve races we've seen (it move operations out of line w/
processing of the actual timeout event?) but I'm thinking this patch looks
good; just what the doctor ordered, at least around passing of the version.
What are your reservations Ram? How hard would it be to add tests?
Good stuff.
> Refactor the TimeoutMonitor to make it less racy
> ------------------------------------------------
>
> Key: HBASE-4015
> URL: https://issues.apache.org/jira/browse/HBASE-4015
> Project: HBase
> Issue Type: Sub-task
> Affects Versions: 0.90.3
> Reporter: Jean-Daniel Cryans
> Assignee: ramkrishna.s.vasudevan
> Priority: Blocker
> Fix For: 0.92.0
>
> Attachments: HBASE-4015_1_trunk.patch, HBASE-4015_2_trunk.patch,
> Timeoutmonitor with state diagrams.pdf
>
>
> The current implementation of the TimeoutMonitor acts like a race condition
> generator, mostly making things worse rather than better. It does it's own
> thing for a while without caring for what's happening in the rest of the
> master.
> The first thing that needs to happen is that the regions should not be
> processed in one big batch, because that sometimes can take minutes to
> process (meanwhile a region that timed out opening might have opened, then
> what happens is it will be reassigned by the TimeoutMonitor generating the
> never ending PENDING_OPEN situation).
> Those operations should also be done more atomically, although I'm not sure
> how to do it in a scalable way in this case.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira