[ https://issues.apache.org/jira/browse/HBASE-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034374#comment-17034374 ]
Hudson commented on HBASE-23693: -------------------------------- Results for branch branch-1 [build #1228 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/1228/]: (x) *{color:red}-1 overall{color}* ---- details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/1228//General_Nightly_Build_Report/] (x) {color:red}-1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/1228//JDK7_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/1228//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Split failure may cause region hole and data loss when use zk assign > -------------------------------------------------------------------- > > Key: HBASE-23693 > URL: https://issues.apache.org/jira/browse/HBASE-23693 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 1.4.8 > Reporter: tianhang tang > Assignee: tianhang tang > Priority: Critical > Fix For: 1.5.1 > > Attachments: HBASE-23693.branch-1.001.patch > > > to mock this case, I add a sleep code in SplitTransactionImpl.excute after > the PONR and before openDaughters: > {code:java} > public PairOfSameType<Region> execute(final Server server, > final RegionServerServices services, User user) throws IOException { > this.server = server; > this.rsServices = services; > useZKForAssignment = server == null ? true : > ConfigUtil.useZKForAssignment(server.getConfiguration()); > if (useCoordinatedStateManager(server)) { > std = > ((BaseCoordinatedStateManager) server.getCoordinatedStateManager()) > .getSplitTransactionCoordination().getDefaultDetails(); > } > PairOfSameType<Region> regions = createDaughters(server, services, user); > if (this.parent.getCoprocessorHost() != null) { > if (user == null) { > parent.getCoprocessorHost().preSplitAfterPONR(); > } else { > try { > user.getUGI().doAs(new PrivilegedExceptionAction<Void>() { > @Override > public Void run() throws Exception { > parent.getCoprocessorHost().preSplitAfterPONR(); > return null; > } > }); > } catch (InterruptedException ie) { > InterruptedIOException iioe = new InterruptedIOException(); > iioe.initCause(ie); > throw iioe; > } > } > } > > //sleep here!!! > try { > Thread.sleep(1000 * 60 * 60); > } catch (InterruptedException e) { > e.printStackTrace(); > } > regions = stepsAfterPONR(server, services, regions, user); > transition(SplitTransactionPhase.COMPLETED); > return regions; > } > {code} > so the split transaction will hang. > then i try to reproduce this problem: > 1.Create a test table and move it into a test rsgroup, there is only 1 RS in > the test group > 2.Trigger a region split > 3.The split transaction step after the PONR and sleep, regioninfo in meta has > been updated > 4.Kill the RS process to mock machine crash > 5.ServerCrashProcedure cleanup SPLITING_NEW region, the daughter regions will > be deleted > 6.ServerCrashProcedure try to assign the parent region, because RS is down > and assign fails, the region status is set to FAILED_OPEN and put back into > regionsInTransition. But at this time, due to RS crash, the node of the > region under ZK region-in-transition no longer exist > 7.CatalogJanitor thread is blocked due to RIT > 8.Switch active master > 9.The CatalogJanitor thread on the new master executes normally and the > parent region is cleaned up because split = true && offline = true in the > meta table > 10.We have a hole in the test table and loss data. > > I modified the code when ServerCrashProcedure cleans up the child regions, it > will update the parent regioninfo in the meta table, and this problem is no > longer reproduced. > I will upload the patch later. -- This message was sent by Atlassian Jira (v8.3.4#803005)