[ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555146#comment-16555146
 ] 

stack commented on HBASE-18152:
-------------------------------

A few notes:

 * I spent a good part of today trying to reproduce but was unable (see  
[^0001-TestWALProcedureExecutore-order-checking-test-that-d.patch] ). Even with 
high-concurrency and artificial friction, 160 concurrent worker threads always 
wrote in-order against default 16 slots.
 * Interesting is that a Worker thread will usually try and run a Procedure 
through to the end UNLESS it suspends or returns true from its 
isYieldAfterExecutionStep implementation. In this case, the AssignProcedure on 
the *Staring* step suspends itself as part of normal operation; a thread in the 
AssignmentManager is in charge of wakening the assign. This is how we make sure 
meta goes out first and how we batch up assigning. This technique makes the 
AssignProcedure in particular susceptible to a thread-switch when we move to 
*Dispatch*.
 * Let me see what would take to get rid of this WALProcedureStore. It is doing 
what we do elsewhere writing WAL but using different primitives (Conditions) 
with recovery and WAL rolling.
 



> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> ---------------------------------------------------------------------
>
>                 Key: HBASE-18152
>                 URL: https://issues.apache.org/jira/browse/HBASE-18152
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 2.0.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 3.0.0
>
>         Attachments: 
> 0001-TestWALProcedureExecutore-order-checking-test-that-d.patch, 
> HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-000002.hwx.site.log.gz, 
> pv2-00000000000000000036.log, pv2-00000000000000000047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to