[ 
https://issues.apache.org/jira/browse/HBASE-21490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690838#comment-16690838
 ] 

Duo Zhang commented on HBASE-21490:
-----------------------------------

OK I think I found the problem...

In ProcedureExecutor.load, we will do this in the finally block

{code}
      try {
        // try to cleanup inactive wals and complete the operation
        buildHoldingCleanupTracker();
        tryCleanupLogsOnLoad();
        loading.set(false);
      } finally {
        lock.unlock();
      }
{code}

And also, in ProcedureExecutor.stop, we will close the current log stream, and 
persist the current storeTracker into the file.

And this is the code when loading procedures
{code}
  public static void load(Iterator<ProcedureWALFile> logs, 
ProcedureStoreTracker tracker,
      Loader loader) throws IOException {
    ProcedureWALFormatReader reader = new ProcedureWALFormatReader(tracker, 
loader);
    tracker.setKeepDeletes(true);
    try {
      // Ignore the last log which is current active log.
      while (logs.hasNext()) {
        ProcedureWALFile log = logs.next();
        log.open();
        try {
          reader.read(log);
        } finally {
          log.close();
        }
      }
      reader.finish();

      // The tracker is now updated with all the procedures read from the logs
      if (tracker.isPartial()) {
        tracker.setPartialFlag(false);
      }
      tracker.resetModified();
    } finally {
      tracker.setKeepDeletes(false);
    }
  }
{code}

And for HBASE-21494, we will throw exception at reader.finish, so we do not 
unset the partial flag, and more important, we do not call resetModified, this 
means that the current storeTracker will have all the active procedures 
modified.

So after the first crash, we will persist the broken storeTracker into the 
file, and when loading the second time, we will load this storeTracker, and 
since we will open another new file, this will not be the last file, which 
means we will use its modified bits when building holdingCleanupTracker, and no 
doubt, it contains all active procedures so we think it is OK to delete the all 
the files before it...

And although the second time we will still crashes, the 
buildHoldingCleanupTracker and removeInactiveLogs are in the finally block, the 
above logic will still be executed and then we will delete all the proc wal 
files...

Let me think how to fix.

[~stack] [~allan163] FYI.

> WALProcedure may remove proc wal files still with active procedures
> -------------------------------------------------------------------
>
>                 Key: HBASE-21490
>                 URL: https://issues.apache.org/jira/browse/HBASE-21490
>             Project: HBase
>          Issue Type: Sub-task
>          Components: proc-v2
>            Reporter: Duo Zhang
>            Priority: Major
>
> It happens for me several times. After master restart, all the procedures are 
> gone.
> And the proc wal files were deleted before restarting, I see this in the 
> master's log
> {noformat}
> 2018-11-16,20:57:40,177 INFO [WALProcedureStoreSyncThread] 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Remove all 
> state logs with ID less than 184, since all the active procedures are in the 
> latest log
> 2018-11-16,20:57:40,177 INFO [WALProcedureStoreSyncThread] 
> org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile: Archiving 
> hdfs://c4tst-xiaomi/hbase/c4tst-sync1/MasterProcWALs/pv2-00000000000000000184.log
>  to hdfs://c4tst-xiaomi/hbase/c4tst-sync1/oldWALs/pv2-00000000000000000184.log
> {noformat}
> Let me dig...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to