[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555146#comment-16555146 ]
stack commented on HBASE-18152: ------------------------------- A few notes: * I spent a good part of today trying to reproduce but was unable (see [^0001-TestWALProcedureExecutore-order-checking-test-that-d.patch] ). Even with high-concurrency and artificial friction, 160 concurrent worker threads always wrote in-order against default 16 slots. * Interesting is that a Worker thread will usually try and run a Procedure through to the end UNLESS it suspends or returns true from its isYieldAfterExecutionStep implementation. In this case, the AssignProcedure on the *Staring* step suspends itself as part of normal operation; a thread in the AssignmentManager is in charge of wakening the assign. This is how we make sure meta goes out first and how we batch up assigning. This technique makes the AssignProcedure in particular susceptible to a thread-switch when we move to *Dispatch*. * Let me see what would take to get rid of this WALProcedureStore. It is doing what we do elsewhere writing WAL but using different primitives (Conditions) with recovery and WAL rolling. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > --------------------------------------------------------------------- > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Affects Versions: 2.0.0 > Reporter: stack > Assignee: stack > Priority: Critical > Fix For: 3.0.0 > > Attachments: > 0001-TestWALProcedureExecutore-order-checking-test-that-d.patch, > HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-000002.hwx.site.log.gz, > pv2-00000000000000000036.log, pv2-00000000000000000047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)