[ https://issues.apache.org/jira/browse/HBASE-21490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690838#comment-16690838 ]
Duo Zhang commented on HBASE-21490: ----------------------------------- OK I think I found the problem... In ProcedureExecutor.load, we will do this in the finally block {code} try { // try to cleanup inactive wals and complete the operation buildHoldingCleanupTracker(); tryCleanupLogsOnLoad(); loading.set(false); } finally { lock.unlock(); } {code} And also, in ProcedureExecutor.stop, we will close the current log stream, and persist the current storeTracker into the file. And this is the code when loading procedures {code} public static void load(Iterator<ProcedureWALFile> logs, ProcedureStoreTracker tracker, Loader loader) throws IOException { ProcedureWALFormatReader reader = new ProcedureWALFormatReader(tracker, loader); tracker.setKeepDeletes(true); try { // Ignore the last log which is current active log. while (logs.hasNext()) { ProcedureWALFile log = logs.next(); log.open(); try { reader.read(log); } finally { log.close(); } } reader.finish(); // The tracker is now updated with all the procedures read from the logs if (tracker.isPartial()) { tracker.setPartialFlag(false); } tracker.resetModified(); } finally { tracker.setKeepDeletes(false); } } {code} And for HBASE-21494, we will throw exception at reader.finish, so we do not unset the partial flag, and more important, we do not call resetModified, this means that the current storeTracker will have all the active procedures modified. So after the first crash, we will persist the broken storeTracker into the file, and when loading the second time, we will load this storeTracker, and since we will open another new file, this will not be the last file, which means we will use its modified bits when building holdingCleanupTracker, and no doubt, it contains all active procedures so we think it is OK to delete the all the files before it... And although the second time we will still crashes, the buildHoldingCleanupTracker and removeInactiveLogs are in the finally block, the above logic will still be executed and then we will delete all the proc wal files... Let me think how to fix. [~stack] [~allan163] FYI. > WALProcedure may remove proc wal files still with active procedures > ------------------------------------------------------------------- > > Key: HBASE-21490 > URL: https://issues.apache.org/jira/browse/HBASE-21490 > Project: HBase > Issue Type: Sub-task > Components: proc-v2 > Reporter: Duo Zhang > Priority: Major > > It happens for me several times. After master restart, all the procedures are > gone. > And the proc wal files were deleted before restarting, I see this in the > master's log > {noformat} > 2018-11-16,20:57:40,177 INFO [WALProcedureStoreSyncThread] > org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Remove all > state logs with ID less than 184, since all the active procedures are in the > latest log > 2018-11-16,20:57:40,177 INFO [WALProcedureStoreSyncThread] > org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile: Archiving > hdfs://c4tst-xiaomi/hbase/c4tst-sync1/MasterProcWALs/pv2-00000000000000000184.log > to hdfs://c4tst-xiaomi/hbase/c4tst-sync1/oldWALs/pv2-00000000000000000184.log > {noformat} > Let me dig... -- This message was sent by Atlassian JIRA (v7.6.3#76005)