[
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035555#comment-16035555
]
stack commented on HBASE-18152:
-------------------------------
This is what corrupt procedures show up as:
{code}
2017-04-17 22:34:50,044 ERROR [ve0524:16000.masterManager]
procedure2.ProcedureExecutor: Corrupt
Procedure=org.apache.hadoop.hbase.master.assignment.AssignProcedure (id=16,
parent=7, owner=stack, state=RUNNABLE, submittedTime=14sec ago,
lastUpdate=14sec ago)
2017-04-17 22:34:50,044 ERROR
[ve0524:16000.masterManager] procedure2.ProcedureExecutor: Corrupt
Procedure=org.apache.hadoop.hbase.master.assignment.AssignProcedure (id=13,
parent=7, owner=stack, state=RUNNABLE, submittedTime=14sec ago,
lastUpdate=14sec ago)
2017-04-17 22:34:50,044 ERROR [ve0524:16000.masterManager]
procedure2.ProcedureExecutor: Corrupt
Procedure=org.apache.hadoop.hbase.master.assignment.AssignProcedure (id=11,
parent=7, owner=stack, state=RUNNABLE, submittedTime=14sec ago,
lastUpdate=14sec ago)
{code}
Attached patch is a workaround that at read time checks that the new-found
entry is for sure 'increasing' when compared to current entry we have for a
Procedure. If not, we WARN and drop it. This workaround is good for the
corruption shown here. Will run with it to see if I can find other corruption
types and to see if I can figure how we are writing out of order.
> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> ---------------------------------------------------------------------
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 2.0.0
> Reporter: stack
> Assignee: stack
> Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HBASE-18152.master.001.patch,
> pv2-00000000000000000047.log, reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing. Its rare enough. Often we
> can get over it but sometimes we can't. It took me a while to capture an
> instance of corruption. Turns out we are write to the WAL out-of-order which
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we
> could write out of order. Will try and get more insight. Meantime parking
> this issue here to fill data into.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)