[
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437835#comment-16437835
]
Josh Elser commented on HBASE-18152:
------------------------------------
I think I saw something similar to this, but it may be less of the bug Stack
outlines and more self-inflicted. I was trying out 2.0.0-rc0 on Hadoop 3.
The version of Hadoop3 I had installed doesn't work with the asyncwal, which I
noticed after tryign to start up the first time. Then, I realized I needed to
set the meta-wal provider configuration option in addition to the normal wal
provider option. IIRC, {{stop-hbase.sh}} hung so I ended up {{kill}}'ing the
process. This caused a bunch of WAL split failures in the Master to get logged
around {{2018-04-12 21:43:01}}.
Interestingly, the Master eventually started complaining around the region
transition that the connection was null (because the server was stopping, I
assume)
{code:java}
2018-04-12 21:43:05,781 WARN [PEWorker-8]
assignment.RegionTransitionProcedure: Retryable error trying to transition:
pid=3586, ppid=3534, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
table=TestTable, region=0e5c6da974e1d3753ba0cb6a8f0db5b1; rit=OPENING,
location=ctr-e138-1518143905142-221855-01-000009.hwx.site,16020,1523564755147
java.io.InterruptedIOException: Interrupted after 7 tries while maxAttempts=31
at
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:173)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:386)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:360)
at
org.apache.hadoop.hbase.MetaTableAccessor.getTableState(MetaTableAccessor.java:1076)
at
org.apache.hadoop.hbase.master.TableStateManager.readMetaState(TableStateManager.java:213)
at
org.apache.hadoop.hbase.master.TableStateManager.getTableState(TableStateManager.java:191)
at
org.apache.hadoop.hbase.master.assignment.AssignProcedure.startTransition(AssignProcedure.java:163)
at
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:295)
at
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:85)
at
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1453)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1741)
{code}
After lots of these, I think I eventually kill'ed the Master and restarted it.
When it came back, I got all of the corrupt pv2 wal errors:
{noformat}
2018-04-12 22:05:24,050 INFO
[master/ctr-e138-1518143905142-221855-01-000002:16000] util.FSHDFSUtils:
Recover lease on dfs file
hdfs://ctr-e138-1518143905142-221855-01-000002.hwx.site:8020/user/hbase/data/MasterProcWALs/pv2-00000000000000000008.log
2018-04-12 22:05:24,054 INFO
[master/ctr-e138-1518143905142-221855-01-000002:16000] util.FSHDFSUtils:
Recovered lease, attempt=0 on
file=hdfs://ctr-e138-1518143905142-221855-01-000002.hwx.site:8020/user/hbase/data/MasterProcWALs/pv2-00000000000000000008.log
after 3ms
2018-04-12 22:05:24,103 INFO
[master/ctr-e138-1518143905142-221855-01-000002:16000]
procedure2.ProcedureExecutor: Recovered WALProcedureStore lease in 55msec
2018-04-12 22:05:24,431 ERROR
[master/ctr-e138-1518143905142-221855-01-000002:16000]
procedure2.ProcedureExecutor: Corrupt pid=3580, ppid=3534,
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=TestTable,
region=71fefffe6b5b3cf1cb6d3328a5a58690
2018-04-12 22:05:24,431 ERROR
[master/ctr-e138-1518143905142-221855-01-000002:16000]
procedure2.ProcedureExecutor: Corrupt pid=3586, ppid=3534,
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=TestTable,
region=0e5c6da974e1d3753ba0cb6a8f0db5b1
2018-04-12 22:05:24,431 ERROR
[master/ctr-e138-1518143905142-221855-01-000002:16000]
procedure2.ProcedureExecutor: Corrupt pid=3593, ppid=3534,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=TestTable,
region=f19fbb2563464bcafd528eeba863c10f
2018-04-12 22:05:24,431 ERROR
[master/ctr-e138-1518143905142-221855-01-000002:16000]
procedure2.ProcedureExecutor: Corrupt pid=3592, ppid=3534,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=TestTable,
region=7df42ee12cc6ff354144e7915e6f0c5a
{noformat}
Will attach the relevant master log. I have ITBLL running now, but will try to
reproduce these steps and see if I can get something reliably happening.
> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> ---------------------------------------------------------------------
>
> Key: HBASE-18152
> URL: https://issues.apache.org/jira/browse/HBASE-18152
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 2.0.0
> Reporter: stack
> Assignee: stack
> Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HBASE-18152.master.001.patch,
> pv2-00000000000000000036.log, pv2-00000000000000000047.log,
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing. Its rare enough. Often we
> can get over it but sometimes we can't. It took me a while to capture an
> instance of corruption. Turns out we are write to the WAL out-of-order which
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we
> could write out of order. Will try and get more insight. Meantime parking
> this issue here to fill data into.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)