[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

Josh Elser (JIRA) Fri, 13 Apr 2018 13:04:25 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437835#comment-16437835
 ]


Josh Elser commented on HBASE-18152:
------------------------------------

I think I saw something similar to this, but it may be less of the bug Stack 
outlines and more self-inflicted. I was trying out 2.0.0-rc0 on Hadoop 3.

The version of Hadoop3 I had installed doesn't work with the asyncwal, which I 
noticed after tryign to start up the first time. Then, I realized I needed to 
set the meta-wal provider configuration option in addition to the normal wal 
provider option. IIRC, {{stop-hbase.sh}} hung so I ended up {{kill}}'ing the 
process. This caused a bunch of WAL split failures in the Master to get logged 
around {{2018-04-12 21:43:01}}.

Interestingly, the Master eventually started complaining around the region 
transition that the connection was null (because the server was stopping, I 
assume)
{code:java}
2018-04-12 21:43:05,781 WARN  [PEWorker-8] 
assignment.RegionTransitionProcedure: Retryable error trying to transition: 
pid=3586, ppid=3534, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=TestTable, region=0e5c6da974e1d3753ba0cb6a8f0db5b1; rit=OPENING, 
location=ctr-e138-1518143905142-221855-01-000009.hwx.site,16020,1523564755147
java.io.InterruptedIOException: Interrupted after 7 tries while maxAttempts=31
    at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:173)
    at org.apache.hadoop.hbase.client.HTable.get(HTable.java:386)
    at org.apache.hadoop.hbase.client.HTable.get(HTable.java:360)
    at 
org.apache.hadoop.hbase.MetaTableAccessor.getTableState(MetaTableAccessor.java:1076)
    at 
org.apache.hadoop.hbase.master.TableStateManager.readMetaState(TableStateManager.java:213)
    at 
org.apache.hadoop.hbase.master.TableStateManager.getTableState(TableStateManager.java:191)
    at 
org.apache.hadoop.hbase.master.assignment.AssignProcedure.startTransition(AssignProcedure.java:163)
    at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:295)
    at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:85)
    at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1453)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1741)
{code}
After lots of these, I think I eventually kill'ed the Master and restarted it. 
When it came back, I got all of the corrupt pv2 wal errors:
{noformat}
2018-04-12 22:05:24,050 INFO  
[master/ctr-e138-1518143905142-221855-01-000002:16000] util.FSHDFSUtils: 
Recover lease on dfs file 
hdfs://ctr-e138-1518143905142-221855-01-000002.hwx.site:8020/user/hbase/data/MasterProcWALs/pv2-00000000000000000008.log
2018-04-12 22:05:24,054 INFO  
[master/ctr-e138-1518143905142-221855-01-000002:16000] util.FSHDFSUtils: 
Recovered lease, attempt=0 on 
file=hdfs://ctr-e138-1518143905142-221855-01-000002.hwx.site:8020/user/hbase/data/MasterProcWALs/pv2-00000000000000000008.log
 after 3ms
2018-04-12 22:05:24,103 INFO  
[master/ctr-e138-1518143905142-221855-01-000002:16000] 
procedure2.ProcedureExecutor: Recovered WALProcedureStore lease in 55msec
2018-04-12 22:05:24,431 ERROR 
[master/ctr-e138-1518143905142-221855-01-000002:16000] 
procedure2.ProcedureExecutor: Corrupt pid=3580, ppid=3534, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=TestTable, 
region=71fefffe6b5b3cf1cb6d3328a5a58690
2018-04-12 22:05:24,431 ERROR 
[master/ctr-e138-1518143905142-221855-01-000002:16000] 
procedure2.ProcedureExecutor: Corrupt pid=3586, ppid=3534, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=TestTable, 
region=0e5c6da974e1d3753ba0cb6a8f0db5b1
2018-04-12 22:05:24,431 ERROR 
[master/ctr-e138-1518143905142-221855-01-000002:16000] 
procedure2.ProcedureExecutor: Corrupt pid=3593, ppid=3534, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=TestTable, 
region=f19fbb2563464bcafd528eeba863c10f
2018-04-12 22:05:24,431 ERROR 
[master/ctr-e138-1518143905142-221855-01-000002:16000] 
procedure2.ProcedureExecutor: Corrupt pid=3592, ppid=3534, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=TestTable, 
region=7df42ee12cc6ff354144e7915e6f0c5a
{noformat}
Will attach the relevant master log. I have ITBLL running now, but will try to 
reproduce these steps and see if I can get something reliably happening.

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> ---------------------------------------------------------------------
>
>                 Key: HBASE-18152
>                 URL: https://issues.apache.org/jira/browse/HBASE-18152
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 2.0.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 2.0.0
>
>         Attachments: HBASE-18152.master.001.patch, 
> pv2-00000000000000000036.log, pv2-00000000000000000047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

Reply via email to