[ 
https://issues.apache.org/jira/browse/HBASE-22761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503599#comment-17503599
 ] 

Xiaolin Ha edited comment on HBASE-22761 at 3/9/22, 1:55 PM:
-------------------------------------------------------------

Though we can make wal entries not be loss, but we can not ensure that every 
wal entry be flushed exactly once, right? Then if an entry is flushed twice, 
the AsyncFSWAL#unackedAppends will release the entry twice, the second flushed 
entry may have dirty data by the released direct bytebuffer.

Here is a sceniano,
 # AsyncFSWAL#toWriteAppends=1,2,3,4,5,6,7,8,9,10...
 # sync entries 1,2,3, whose total size is up to the configured batch size. 
toWriteAppends=4,5,6,7,8,9, unackedAppends=1,2,3;
 # sync entries 4,5,6, toWriteAppends=7,8,9, unackedAppends=1,2,3,4,5,6;
 # sync entries 7,8,9, toWriteAppends=10..., unackedAppends=1,2,3,4,5,6,7,8,9;
 # 1,2,3 sync failed, then all the entries in the unackedAppends are added back 
to toWriteAppends, toWriteAppends=1,2,3,4,5,6,7,8,9,10...., 
unackedAppends=1,2,3,4,5,6,7,8,9;
 # wal rolled success, just caused by writer broken;
 # the last epoch wal writer flushed 4,5,6 fail, but nothing happens now, the 
toWriteAppends=1,2,3,4,5,6,7,8,9,10...., unackedAppends=1,2,3,4,5,6,7,8,9;
 # the last epoch wal writer flushed 7,8,9 success, then the unackedAppends=[], 
1,2,3,4,5,6,7,8,9 entries are released (but they are in the toWriteAppends and 
waiting for the next new writer to flush);
 # the new epoch wal writer sync entries 1,2,3, toWriteAppends=4,5,6,7,8,9, 
unackedAppends=1,2,3;
 # 1,2,3 are flushed success, but the direct bytebuffer they refered to has 
already be released and may have been written new data.

[~zhangduo]  [~comnetwork] what do you think?


was (Author: xiaolin ha):
Though we can make wal entries not be loss, but we can not ensure that every 
wal entry be flushed exactly once, right? Then if an entry is flushed twice, 
the AsyncFSWAL#unackedAppends will release the entry twice, the second flushed 
entry may have dirty data by the released direct bytebuffer.

Here is a sceniano,
 # AsyncFSWAL#toWriteAppends=1,2,3,4,5,6,7,8,9,10...
 # sync entries 1,2,3, whose total size is up to the configured batch size. 
toWriteAppends=4,5,6,7,8,9, unackedAppends=1,2,3;
 # sync entries 4,5,6, toWriteAppends=7,8,9, unackedAppends=1,2,3,4,5,6;
 # sync entries 7,8,9, toWriteAppends=10..., unackedAppends=1,2,3,4,5,6,7,8,9;
 # 1,2,3 sync failed, then all the entries in the unackedAppends are added back 
to toWriteAppends, toWriteAppends=1,2,3,4,5,6,7,8,9,10...., 
unackedAppends=1,2,3,4,5,6,7,8,9;
 # wal rolled success, just caused by writer broken;
 # the last epoch wal writer flushed 4,5,6 fail, but nothing happens now, the 
toWriteAppends=1,2,3,4,5,6,7,8,9,10...., unackedAppends=1,2,3,4,5,6,7,8,9;
 # the last epoch wal writer flushed 7,8,9 success, then the unackedAppends=[], 
1,2,3,4,5,6,7,8,9 entries are released (but they are in the toWriteAppends and 
waiting for the next new writer to flush);
 # the new epotch wal writer sync entries 1,2,3, toWriteAppends=4,5,6,7,8,9, 
unackedAppends=1,2,3;
 # 1,2,3 are flushed success, but the direct bytebuffer they refered to has 
already be released and may have been written new data.

[~zhangduo]  [~comnetwork] what do you think?

> Caught ArrayIndexOutOfBoundsException while processing event RS_LOG_REPLAY
> --------------------------------------------------------------------------
>
>                 Key: HBASE-22761
>                 URL: https://issues.apache.org/jira/browse/HBASE-22761
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.1.1
>            Reporter: casuallc
>            Priority: Major
>         Attachments: tmp
>
>
> RegionServer exists when error happen
> {code:java}
> 2019-07-29 20:51:09,726 INFO [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0] 
> wal.WALSplitter: Processed 0 edits across 0 regions; edits skipped=0; log 
> file=hdfs://cluster1/hbase/WALs/h2,16020,1564216856546-splitting/h2%2C16020%2C1564216856546.1564398538121,
>  length=615233, corrupted=false, progress failed=false
> 2019-07-29 20:51:09,726 INFO [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0] 
> handler.WALSplitterHandler: Worker h1,16020,1564404572589 done with task 
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@577da0d3
>  in 84892ms. Status = null
> 2019-07-29 20:51:09,726 ERROR [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0] 
> executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY
> java.lang.ArrayIndexOutOfBoundsException: 16403
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)
> at 
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)
> at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100)
> at 
> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:70)
> at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2019-07-29 20:51:09,730 ERROR [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0] 
> regionserver.HRegionServer: ***** ABORTING region server 
> h1,16020,1564404572589: Caught throwable while processing event RS_LOG_REPLAY 
> *****
> java.lang.ArrayIndexOutOfBoundsException: 16403
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)
> at 
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)
> at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100)
> at 
> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:70)
> at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to