[ https://issues.apache.org/jira/browse/HBASE-14012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612446#comment-14612446 ]
stack commented on HBASE-14012: ------------------------------- Here is a bit of log: {code} 2015-06-09 20:06:20,270 INFO [c2020:16000.activeMasterManager] master.ServerManager: AssignmentManager hasn't finished failover cleanup; waiting 2015-06-09 20:06:20,272 INFO [c2020:16000.activeMasterManager] master.HMaster: hbase:meta with replicaId 0 assigned=0, rit=false, location=c2025.halxg.cloudera.com,16020,1433892619022 2015-06-09 20:06:20,295 DEBUG [ProcedureExecutorThread-4] regionserver.HRegionFileSystem: No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/c607a47967fd4873135f38e883156e4d/big 2015-06-09 20:06:20,295 DEBUG [ProcedureExecutorThread-10] regionserver.HRegionFileSystem: No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/1a5a90047a76da6dddebb5aff0acb275/big 2015-06-09 20:06:20,342 DEBUG [hconnection-0x680c3bc0-shared--pool3-t1] ipc.RpcClientImpl: Use SIMPLE authentication for service ClientService, sasl=false 2015-06-09 20:06:20,342 DEBUG [hconnection-0x680c3bc0-shared--pool3-t1] ipc.RpcClientImpl: Connecting to c2025.halxg.cloudera.com/10.20.84.31:16020 2015-06-09 20:06:20,376 DEBUG [ProcedureExecutorThread-4] regionserver.HRegionFileSystem: No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/c607a47967fd4873135f38e883156e4d/tiny 2015-06-09 20:06:20,379 DEBUG [ProcedureExecutorThread-10] regionserver.HRegionFileSystem: No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/1a5a90047a76da6dddebb5aff0acb275/tiny 2015-06-09 20:06:20,383 DEBUG [ProcedureExecutorThread-4] regionserver.HRegionFileSystem: No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/d586e9037f683384411ab2663e31f97b/big 2015-06-09 20:06:20,383 DEBUG [ProcedureExecutorThread-10] regionserver.HRegionFileSystem: No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/ce4ebb9a375a1fe4b5777d2d960c940c/big 2015-06-09 20:06:20,420 DEBUG [ProcedureExecutorThread-4] regionserver.HRegionFileSystem: No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/d586e9037f683384411ab2663e31f97b/tiny 2015-06-09 20:06:20,421 DEBUG [ProcedureExecutorThread-4] regionserver.HRegionFileSystem: No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/6dc837d3ec4e2afd05314472ee17ca80/big 2015-06-09 20:06:20,422 DEBUG [ProcedureExecutorThread-10] regionserver.HRegionFileSystem: No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/ce4ebb9a375a1fe4b5777d2d960c940c/tiny 2015-06-09 20:06:20,423 DEBUG [ProcedureExecutorThread-10] regionserver.HRegionFileSystem: No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/6fbe22ff15c2e5f2b207f79eaf8f382a/big 2015-06-09 20:06:20,453 DEBUG [ProcedureExecutorThread-10] regionserver.HRegionFileSystem: No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/6fbe22ff15c2e5f2b207f79eaf8f382a/tiny ... 2015-06-09 20:06:20,795 DEBUG [ProcedureExecutorThread-4] regionserver.HRegionFileSystem: No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/0983b02ec079ea8ac2fb2901dbe2a6fb/tiny 2015-06-09 20:06:20,797 INFO [ProcedureExecutorThread-4] master.AssignmentManager: Bulk assigning 9 region(s) across 5 server(s), round-robin=true .... 2015-06-09 20:06:20,909 INFO [c2020:16000.activeMasterManager] master.AssignmentManager: Found regions out on cluster or in RIT; presuming failover {code} Its the bulk assign there on the end that is doing assign of regions already out on cluster. > Double Assignment and Dataloss when ServerCrashProcedure runs during Master > failover > ------------------------------------------------------------------------------------ > > Key: HBASE-14012 > URL: https://issues.apache.org/jira/browse/HBASE-14012 > Project: HBase > Issue Type: Bug > Components: master, Region Assignment > Affects Versions: 2.0.0, 1.2.0 > Reporter: stack > Assignee: stack > Priority: Critical > > ITBLL. Master comes up. It is joining a running cluster (all servers up > except Master with most regions assigned out on cluster). ProcedureStore has > two ServerCrashProcedures unfinished (RUNNABLE state). In SCP, we only check > if failover in first step, not for every step, which means > ServerCrashProcedure will run if on reload it is beyond the first step. > {code} > // Is master fully online? If not, yield. No processing of servers unless > master is up > if (!services.getAssignmentManager().isFailoverCleanupDone()) { > throwProcedureYieldException("Waiting on master failover to complete"); > } > {code} > There is no definitive logging but it looks like we start running at the > assign step. The regions to assign were persisted before master crash. The > regions to assign may not make sense post crash: i.e. here we double-assign. > Checking. We shouldn't run until master is fully up regardless. -- This message was sent by Atlassian JIRA (v6.3.4#6332)