Can you come up with a test that shows the problem ? Consider opening a JIRA with anonymized master log, your test and proposed solution (if you have one).
Cheers On Thu, Sep 10, 2015 at 6:09 AM, zhou_shuaif...@sina.com < zhou_shuaif...@sina.com> wrote: > Hi all, > I found a situation may cause region closed forever, and this > situation happend usually on my cluster, version is 0.98.10, but 1.1.2 also > have the problem: > 1, master send region open to regionserver > 2, rs open a handler do openregion > 3, rs return resopnse to master > 3, master not received the response, or timeout, send open region again > 4, rs already opened the region > 5, master processAlreadyOpenedRegion, update regionstate open in > master memory > 6, master received zk message region opened(for some reason late, eg: > net work), and triger update regionstate open, but find that region already > opened, ERROR! > 7, master send close region, and region be closed forever. > > may be a solution is change processAlreadyOpenedRegion in class > AssignmentManager: > > > private void processAlreadyOpenedRegion(HRegionInfo region, ServerName sn) { > // Remove region from in-memory transition and unassigned node from ZK > // While trying to enable the table the regions of the table were > // already enabled. > LOG.debug("ALREADY_OPENED " + region.getRegionNameAsString() > + " to " + sn); > String encodedName = region.getEncodedName(); > > /** > * check region state in zk, if already opened, return; leave the > regionStates work to zkStatus change to trigger. > **/ > > > > deleteNodeInStates(encodedName, "offline", sn, > EventType.M_ZK_REGION_OFFLINE); > regionStates.regionOnline(region, sn); > } > ------------------------------ > zhou_shuaif...@sina.com >