[
https://issues.apache.org/jira/browse/HBASE-7390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537874#comment-13537874
]
nkeywal commented on HBASE-7390:
--------------------------------
bq. I'm not sure I follow.
{code}
Boolean openAction = regionsInTransitionInRS.get(encodedName);
if (openAction != null) { // If this region is already in
transition
if (openAction.booleanValue()) { // and is opening
regionsInTransitionInRS.replace(encodedName, openAction,
Boolean.FALSE);
// say it's now closing
}
checkIfRegionInTransition(encodedName, CLOSE); // throw an exception
// we will never go there, we always throw an exception at the previous
line
// So if we were opening, we now have a OpenRegionHandler running but
the RIT state says closing.
// What's going to happen to the znode now?
}
// If we're there, it means there is nothing in RIT
requestCount.increment();
LOG.info("Received close region: " + region.getRegionNameAsString() +
". Version of ZK closing node:" + versionOfClosingNode +
". Destination server:" + sn);
HRegionInfo regionInfo = region.getRegionInfo();
checkIfRegionInTransition(encodedName, CLOSE); // We've just checked
0.0001 ms ago!
{code}
I looked at it more, it's actually by design: the OpenRegionHandler checks for
the content of this variable. So the behavior is: 'if we received the request
to close a region we're currently opening, stop the opening but raise an
exception saying we can't close again because we were already closing".
Issues are:
- What's going to happen with the znode? It seems it won't be changed in the
openHandler for example. So it will remain forever.
- It's a little bit racy: if we were at the end of the opening, we can get the
exception but the opening will be done.
- worse, the closeRegion call will be rejected at its very beginning, because
when we try to do the "getRegionByEncodedName", it throws a "RegionIsNotOnline"
=> In many cases, the opening will continue, as if the region is opening it
should not be in the onlineRegions list.
=> It could be an impact of the coprocessor implementation.
=> As a side node, as the closeRegion closes are nested, the coprocessors
"preClose" will be called multiple times.
=> As another side node, the way it's coded in the other closeRegion is that
we don't call the coprocessors when the region is not online
=> So it means that we don"t check the access rigts if the region is not
online, it allows to cancel an opening without the AR.
- As well, we had no test case (that's why my patch, while broken, worked).
I'm trying to solve this in the last version.
bq. + zkw.sync(encoded);
It's not absolutely necessary, but it maximizes corectness probability. It's
done in all other cases as well, so not doing it will would be bad imho, as it
would break consistency.
bq. Review comments. Some error. e.g. + * @param expectedVersion expecpted
version og the znode It looks like klingon.
Done.
bq. This below continue is right?
I've changed the implementation to remove the continue. I found another issue
while doing so. Previously, when the region was already in RIT and closing, we
were saying "OPENED" on a bulkAssign (that is: the query was open, and we were
saying 'opened' while closing the region). This seems wrong. We now say
"FAILED_OPENING" in this case.
bq. On the patch, what Sergey said about different closeRegion returns. Could
the one that throws an exception be private?
I looked at it in more details. Actually, what returns this function is never
checked. So I've renamed it and clean stuff around it.
bq. Would putIfAbsent be easier to read if it took an enum rather than a
boolean to say whether open or close
It's was to be compatible with RIT design. I removed the method. I tried to
change RIT, but its type is public through the getter, so I prefer not doing it
in a patch.
I will publish an updated version soon.
> Add extra test cases for assignement on the region server and fix the related
> issues
> ------------------------------------------------------------------------------------
>
> Key: HBASE-7390
> URL: https://issues.apache.org/jira/browse/HBASE-7390
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment, regionserver
> Affects Versions: 0.96.0
> Reporter: nkeywal
> Assignee: nkeywal
> Fix For: 0.96.0
>
> Attachments: 7390.v1.patch, 7390.v2.patch, 7390.v3.patch,
> 7390.v4.patch, assignment_zk_states.jpg
>
>
> We don't have a lot of tests on the region server itself.
> Here are some.
> Some of them are failing, feedback welcome.
> See as well the attached state diagram for the ZK nodes on assignment.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira