[
https://issues.apache.org/jira/browse/HBASE-22168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Shelukhin updated HBASE-22168:
-------------------------------------
Summary: proc WALs with non-corrupted-but-"corrupted" procedures block WAL
archiving forever (was: proc WALs with non-corrupted-but-"corrupted" block WAL
archiving forever)
> proc WALs with non-corrupted-but-"corrupted" procedures block WAL archiving
> forever
> -----------------------------------------------------------------------------------
>
> Key: HBASE-22168
> URL: https://issues.apache.org/jira/browse/HBASE-22168
> Project: HBase
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Priority: Critical
>
> I've reported the bug before where we get these messages when loading proc WAL
> {noformat}
> 2019-04-04 14:43:00,424 ERROR [master/...:becomeActiveMaster]
> wal.WALProcedureTree: Missing stack id 43459, max stack id is 43460, root
> procedure is Procedure(pid=43645, ppid=-1,
> class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure)
> {noformat}
> resulting in
> {noformat}
> 2019-04-04 14:43:16,176 ERROR [...:17000:becomeActiveMaster]
> procedure2.ProcedureExecutor: Corrupt pid=43645,
> state=WAITING:SERVER_CRASH_FINISH, hasLock=false; ServerCrashProcedure
> server=..., splitWal=true, meta=false
> {noformat}
> There is no actual corruption in the file, so it never gets moved to
> corrupted files.
> However, there's no accounting for these kind of procedures in the tracker as
> far as I can tell (I didn't spend a lot of time looking at the code though)
> so as a result we get 100s of proc wals that are stuck forever because of
> some ancient file with these WALs; that causes master startup to take a long
> time.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)