[ 
https://issues.apache.org/jira/browse/HBASE-7390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537874#comment-13537874
 ] 

nkeywal commented on HBASE-7390:
--------------------------------

bq. I'm not sure I follow.
{code}
      Boolean openAction = regionsInTransitionInRS.get(encodedName);
      if (openAction != null) {         // If this region is already in 
transition
        if (openAction.booleanValue()) { // and is opening         
          regionsInTransitionInRS.replace(encodedName, openAction, 
Boolean.FALSE);
          // say it's now closing
        }
        checkIfRegionInTransition(encodedName, CLOSE); // throw an exception
        // we will never go there, we always throw an exception at the previous 
line
        // So if we were opening, we now have a OpenRegionHandler running but 
the RIT state says closing.
        // What's going to happen to the znode now?
      }
      // If we're there, it means there is nothing in RIT    
      requestCount.increment();
      LOG.info("Received close region: " + region.getRegionNameAsString() +
        ". Version of ZK closing node:" + versionOfClosingNode +
        ". Destination server:" + sn);
      HRegionInfo regionInfo = region.getRegionInfo();         
      checkIfRegionInTransition(encodedName, CLOSE); // We've just checked 
0.0001 ms ago!
{code}

I looked at it more, it's actually by design: the OpenRegionHandler checks for 
the content of this variable. So the behavior is: 'if we received the request 
to close a region we're currently opening, stop the opening but raise an 
exception saying we can't close again because we were already closing".
Issues are:
- What's going to happen with the znode? It seems it won't be changed in the 
openHandler for example. So it will remain forever.
- It's a little bit racy: if we were at the end of the opening, we can get the 
exception but the opening will be done.
- worse, the closeRegion call will be rejected at its very beginning, because 
when we try to do the "getRegionByEncodedName", it throws a "RegionIsNotOnline"
  => In many cases, the opening will continue, as if the region is opening it 
should not be in the onlineRegions list.
  => It could be an impact of the coprocessor implementation.
  => As a side node, as the closeRegion closes are nested, the coprocessors 
"preClose" will be called multiple times.
  => As another side node, the way it's coded in the other closeRegion is that 
we don't call the coprocessors when the region is not online
  => So it means that we don"t check the access rigts if the region is not 
online, it allows to cancel an opening without the AR.
- As well, we had no test case (that's why my patch, while broken, worked).
 
I'm trying to solve this in the last version.


bq. + zkw.sync(encoded);
It's not absolutely necessary, but it maximizes corectness probability. It's 
done in all other cases as well, so not doing it will would be bad imho, as it 
would break consistency.

bq. Review comments. Some error. e.g. + * @param expectedVersion expecpted 
version og the znode It looks like klingon.
Done.

bq. This below continue is right?
I've changed the implementation to remove the continue. I found another issue 
while doing so. Previously, when the region was already in RIT and closing, we 
were saying "OPENED" on a bulkAssign (that is: the query was open, and we were 
saying 'opened' while closing the region). This seems wrong. We now say 
"FAILED_OPENING" in this case.
 

bq. On the patch, what Sergey said about different closeRegion returns. Could 
the one that throws an exception be private?
I looked at it in more details. Actually, what returns this function is never 
checked. So I've renamed it and clean stuff around it.


bq. Would putIfAbsent be easier to read if it took an enum rather than a 
boolean to say whether open or close
It's was to be compatible with RIT design. I removed the method. I tried to 
change RIT, but its type is public through the getter, so I prefer not doing it 
in a patch.


I will publish an updated version soon.
                
> Add extra test cases for assignement on the region server and fix the related 
> issues
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-7390
>                 URL: https://issues.apache.org/jira/browse/HBASE-7390
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment, regionserver
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>             Fix For: 0.96.0
>
>         Attachments: 7390.v1.patch, 7390.v2.patch, 7390.v3.patch, 
> 7390.v4.patch, assignment_zk_states.jpg
>
>
> We don't have a lot of tests on the region server itself.
> Here are some.
> Some of them are failing, feedback welcome.
> See as well the attached state diagram for the ZK nodes on assignment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to