[
https://issues.apache.org/jira/browse/HBASE-22761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506070#comment-17506070
]
Xiaolin Ha edited comment on HBASE-22761 at 3/14/22, 8:53 AM:
--------------------------------------------------------------
Thanks for your reply, [~zhangduo] .
> Actually the syncFailed has a greater sequence id but we do not pass it to
> the upper layer, and it is not practical to wait for the later syncCompleted,
> we should try to recover ASAP.
I agree with you, we should make those waited futures be recovered ASAP. And
currently, the failed higher seqid only acked those not-acked packets on the
same channel, but other channels waited lower futures should also be acked.
{code:java}
private void failWaitingAckQueue(Channel channel,
java.util.function.Supplier<Throwable> errorSupplier) {
Throwable error = errorSupplier.get();
for (Iterator<Callback> iter = waitingAckQueue.iterator(); iter.hasNext();) {
Callback c = iter.next();
// find the first sync request which we have not acked yet and fail all the
request after it.
if (!c.unfinishedReplicas.contains(channel.id())) {
continue;
}
for (;;) {
c.future.completeExceptionally(error);
.....{code}
> Maybe we could just check whether the writer instance is still the same in
> syncCompleted? For normal case, it is impossible that we still want to
> complete a request for the previous writer? We need to make sure all the
> outcoming requests are finished before rolling?
I added a simple fix in HBase-26832 before your reply, but that is not enough.
You can create a new one or I assign this one to you.
Thanks.
was (Author: xiaolin ha):
> Actually the syncFailed has a greater sequence id but we do not pass it to
> the upper layer, and it is not practical to wait for the later syncCompleted,
> we should try to recover ASAP.
I agree with you, we should make those waited futures be recovered ASAP. And
currently, the failed higher seqid only acked those not-acked packets on the
same channel, but other channels waited lower futures should also be acked.
{code:java}
private void failWaitingAckQueue(Channel channel,
java.util.function.Supplier<Throwable> errorSupplier) {
Throwable error = errorSupplier.get();
for (Iterator<Callback> iter = waitingAckQueue.iterator(); iter.hasNext();) {
Callback c = iter.next();
// find the first sync request which we have not acked yet and fail all the
request after it.
if (!c.unfinishedReplicas.contains(channel.id())) {
continue;
}
for (;;) {
c.future.completeExceptionally(error);
.....{code}
> Maybe we could just check whether the writer instance is still the same in
> syncCompleted? For normal case, it is impossible that we still want to
> complete a request for the previous writer? We need to make sure all the
> outcoming requests are finished before rolling?
I added a simple fix in HBase-26832 before your reply, but that is not enough.
You can create a new one or I assign this one to you.
Thanks.
> Caught ArrayIndexOutOfBoundsException while processing event RS_LOG_REPLAY
> --------------------------------------------------------------------------
>
> Key: HBASE-22761
> URL: https://issues.apache.org/jira/browse/HBASE-22761
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.1.1
> Reporter: casuallc
> Priority: Major
> Attachments: tmp
>
>
> RegionServer exists when error happen
> {code:java}
> 2019-07-29 20:51:09,726 INFO [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0]
> wal.WALSplitter: Processed 0 edits across 0 regions; edits skipped=0; log
> file=hdfs://cluster1/hbase/WALs/h2,16020,1564216856546-splitting/h2%2C16020%2C1564216856546.1564398538121,
> length=615233, corrupted=false, progress failed=false
> 2019-07-29 20:51:09,726 INFO [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0]
> handler.WALSplitterHandler: Worker h1,16020,1564404572589 done with task
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@577da0d3
> in 84892ms. Status = null
> 2019-07-29 20:51:09,726 ERROR [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0]
> executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY
> java.lang.ArrayIndexOutOfBoundsException: 16403
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)
> at
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)
> at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195)
> at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100)
> at
> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:70)
> at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2019-07-29 20:51:09,730 ERROR [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0]
> regionserver.HRegionServer: ***** ABORTING region server
> h1,16020,1564404572589: Caught throwable while processing event RS_LOG_REPLAY
> *****
> java.lang.ArrayIndexOutOfBoundsException: 16403
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)
> at
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)
> at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195)
> at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100)
> at
> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:70)
> at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)