[ 
https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183205#comment-13183205
 ] 

ramkrishna.s.vasudevan commented on HBASE-5155:
-----------------------------------------------

@Stack
After analysing the code found one thing. May be avoiding SSH and 
DisableTableHandler and DeleteTableHandler parallely is a bigger discussion. 
But the above problem can be solved. 
In SSH 
{code}
  public static boolean processDeadRegion(HRegionInfo hri, Result result,
      AssignmentManager assignmentManager, CatalogTracker catalogTracker)
  throws IOException {
    // If table is not disabled but the region is offlined,
    boolean disabled = assignmentManager.getZKTable().isDisabledTable(
        hri.getTableDesc().getNameAsString());
{code}
we check if the table is disabled.  But if you look at the above logs it is the 
DeleteTableHandler that has already deleted the region and also removed the 
cache from ZkTable.
{code}
am.getZKTable().setEnabledTable(Bytes.toString(tableName));
{code}
Currently setEnabledTable means removing the entry from the map.  So we do not 
have a differentiation between enabled table and delete the table because both 
places we remove from the cache map.

So can we  use the unused TableState.ENABLED in case of enable table handler 
and only delete table handler will remove it.
This will ensure that in SSH.processDeadRegion() we can first check if the 
table is not present in the map and then proceed. If not present we can ensure 
that the table is already deleted.  
Pls give your opinion.
                
> ServerShutDownHandler And Disable/Delete should not happen parallely leading 
> to recreation of regions that were deleted
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5155
>                 URL: https://issues.apache.org/jira/browse/HBASE-5155
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.4
>            Reporter: ramkrishna.s.vasudevan
>            Priority: Blocker
>
> ServerShutDownHandler and disable/delete table handler races.  This is not an 
> issue due to TM.
> -> A regionserver goes down.  In our cluster the regionserver holds lot of 
> regions.
> -> A region R1 has two daughters D1 and D2.
> -> The ServerShutdownHandler gets called and scans the META and gets all the 
> user regions
> -> Parallely a table is disabled. (No problem in this step).
> -> Delete table is done.
> -> The tables and its regions are deleted including R1, D1 and D2.. (So META 
> is cleaned)
> -> Now ServerShutdownhandler starts to processTheDeadRegion
> {code}
>  if (hri.isOffline() && hri.isSplit()) {
>       LOG.debug("Offlined and split region " + hri.getRegionNameAsString() +
>         "; checking daughter presence");
>       fixupDaughters(result, assignmentManager, catalogTracker);
> {code}
> As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 
> {code}
>     if (isDaughterMissing(catalogTracker, daughter)) {
>       LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString());
>       MetaEditor.addDaughter(catalogTracker, daughter, null);
>       // TODO: Log WARN if the regiondir does not exist in the fs.  If its not
>       // there then something wonky about the split -- things will keep going
>       // but could be missing references to parent region.
>       // And assign it.
>       assignmentManager.assign(daughter, true);
> {code}
> we call assign of the daughers.  
> Now after this we again start with the below code.
> {code}
>         if (processDeadRegion(e.getKey(), e.getValue(),
>             this.services.getAssignmentManager(),
>             this.server.getCatalogTracker())) {
>           this.services.getAssignmentManager().assign(e.getKey(), true);
> {code}
> Now when the SSH scanned the META it had R1, D1 and D2.
> So as part of the above code D1 and D2 which where assigned by fixUpDaughters
> is again assigned by 
> {code}
> this.services.getAssignmentManager().assign(e.getKey(), true);
> {code}
> Thus leading to a zookeeper issue due to bad version and killing the master.
> The important part here is the regions that were deleted are recreated which 
> i think is more critical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to