[
https://issues.apache.org/jira/browse/HBASE-22761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503599#comment-17503599
]
Xiaolin Ha commented on HBASE-22761:
------------------------------------
Though we can make wal entries not be loss, but we can not ensure that every
wal entry be flushed exactly once, right? Then if an entry is flushed twice,
the AsyncFSWAL#unackedAppends will release the entry twice, the second flushed
entry may have dirty data by the released direct bytebuffer.
Here is a sceniano,
# AsyncFSWAL#toWriteAppends=1,2,3,4,5,6,7,8,9,10...
# sync entries 1,2,3, whose total size is up to the configured batch size.
toWriteAppends=4,5,6,7,8,9, unackedAppends=1,2,3;
# sync entries 4,5,6, toWriteAppends=7,8,9, unackedAppends=1,2,3,4,5,6;
# sync entries 7,8,9, toWriteAppends=10..., unackedAppends=1,2,3,4,5,6,7,8,9;
# 1,2,3 sync failed, then all the entries in the unackedAppends are added back
to toWriteAppends, toWriteAppends=1,2,3,4,5,6,7,8,9,10....,
unackedAppends=1,2,3,4,5,6,7,8,9;
# wal rolled success, just caused by writer broken;
# the last epoch wal writer flushed 4,5,6 fail, but nothing happens now, the
toWriteAppends=1,2,3,4,5,6,7,8,9,10...., unackedAppends=1,2,3,4,5,6,7,8,9;
# the last epoch wal writer flushed 7,8,9 success, then the unackedAppends=[],
1,2,3,4,5,6,7,8,9 entries are released (but they are in the toWriteAppends and
waiting for the next new writer to flush);
# the new epotch wal writer sync entries 1,2,3, toWriteAppends=4,5,6,7,8,9,
unackedAppends=1,2,3;
# 1,2,3 are flushed success, but the direct bytebuffer they refered to has
already be released and may have been written new data.
[~zhangduo] [~comnetwork] what do you think?
> Caught ArrayIndexOutOfBoundsException while processing event RS_LOG_REPLAY
> --------------------------------------------------------------------------
>
> Key: HBASE-22761
> URL: https://issues.apache.org/jira/browse/HBASE-22761
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.1.1
> Reporter: casuallc
> Priority: Major
> Attachments: tmp
>
>
> RegionServer exists when error happen
> {code:java}
> 2019-07-29 20:51:09,726 INFO [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0]
> wal.WALSplitter: Processed 0 edits across 0 regions; edits skipped=0; log
> file=hdfs://cluster1/hbase/WALs/h2,16020,1564216856546-splitting/h2%2C16020%2C1564216856546.1564398538121,
> length=615233, corrupted=false, progress failed=false
> 2019-07-29 20:51:09,726 INFO [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0]
> handler.WALSplitterHandler: Worker h1,16020,1564404572589 done with task
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@577da0d3
> in 84892ms. Status = null
> 2019-07-29 20:51:09,726 ERROR [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0]
> executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY
> java.lang.ArrayIndexOutOfBoundsException: 16403
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)
> at
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)
> at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195)
> at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100)
> at
> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:70)
> at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2019-07-29 20:51:09,730 ERROR [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0]
> regionserver.HRegionServer: ***** ABORTING region server
> h1,16020,1564404572589: Caught throwable while processing event RS_LOG_REPLAY
> *****
> java.lang.ArrayIndexOutOfBoundsException: 16403
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)
> at
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)
> at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195)
> at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100)
> at
> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:70)
> at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)