[ 
https://issues.apache.org/jira/browse/HBASE-22761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506070#comment-17506070
 ] 

Xiaolin Ha edited comment on HBASE-22761 at 3/14/22, 8:53 AM:
--------------------------------------------------------------

Thanks for your reply, [~zhangduo] .

> Actually the syncFailed has a greater sequence id but we do not pass it to 
> the upper layer, and it is not practical to wait for the later syncCompleted, 
> we should try to recover ASAP.

I agree with you, we should make those waited futures be recovered ASAP. And 
currently, the failed higher seqid only acked those not-acked packets on the 
same channel, but other channels waited lower futures should also be acked.
{code:java}
private void failWaitingAckQueue(Channel channel, 
java.util.function.Supplier<Throwable> errorSupplier) {
  Throwable error = errorSupplier.get();
  for (Iterator<Callback> iter = waitingAckQueue.iterator(); iter.hasNext();) {
    Callback c = iter.next();
    // find the first sync request which we have not acked yet and fail all the 
request after it.
    if (!c.unfinishedReplicas.contains(channel.id())) {
      continue;
    }
    for (;;) {
      c.future.completeExceptionally(error);
.....{code}
> Maybe we could just check whether the writer instance is still the same in 
> syncCompleted? For normal case, it is impossible that we still want to 
> complete a request for the previous writer? We need to make sure all the 
> outcoming requests are finished before rolling?

I added a simple fix in HBase-26832 before your reply, but that is not enough. 
You can create a new one or I assign this one to you.

Thanks.


was (Author: xiaolin ha):
> Actually the syncFailed has a greater sequence id but we do not pass it to 
> the upper layer, and it is not practical to wait for the later syncCompleted, 
> we should try to recover ASAP.

I agree with you, we should make those waited futures be recovered ASAP. And 
currently, the failed higher seqid only acked those not-acked packets on the 
same channel, but other channels waited lower futures should also be acked.
{code:java}
private void failWaitingAckQueue(Channel channel, 
java.util.function.Supplier<Throwable> errorSupplier) {
  Throwable error = errorSupplier.get();
  for (Iterator<Callback> iter = waitingAckQueue.iterator(); iter.hasNext();) {
    Callback c = iter.next();
    // find the first sync request which we have not acked yet and fail all the 
request after it.
    if (!c.unfinishedReplicas.contains(channel.id())) {
      continue;
    }
    for (;;) {
      c.future.completeExceptionally(error);
.....{code}
> Maybe we could just check whether the writer instance is still the same in 
> syncCompleted? For normal case, it is impossible that we still want to 
> complete a request for the previous writer? We need to make sure all the 
> outcoming requests are finished before rolling?

I added a simple fix in HBase-26832 before your reply, but that is not enough. 
You can create a new one or I assign this one to you.

Thanks.

> Caught ArrayIndexOutOfBoundsException while processing event RS_LOG_REPLAY
> --------------------------------------------------------------------------
>
>                 Key: HBASE-22761
>                 URL: https://issues.apache.org/jira/browse/HBASE-22761
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.1.1
>            Reporter: casuallc
>            Priority: Major
>         Attachments: tmp
>
>
> RegionServer exists when error happen
> {code:java}
> 2019-07-29 20:51:09,726 INFO [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0] 
> wal.WALSplitter: Processed 0 edits across 0 regions; edits skipped=0; log 
> file=hdfs://cluster1/hbase/WALs/h2,16020,1564216856546-splitting/h2%2C16020%2C1564216856546.1564398538121,
>  length=615233, corrupted=false, progress failed=false
> 2019-07-29 20:51:09,726 INFO [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0] 
> handler.WALSplitterHandler: Worker h1,16020,1564404572589 done with task 
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@577da0d3
>  in 84892ms. Status = null
> 2019-07-29 20:51:09,726 ERROR [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0] 
> executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY
> java.lang.ArrayIndexOutOfBoundsException: 16403
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)
> at 
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)
> at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100)
> at 
> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:70)
> at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2019-07-29 20:51:09,730 ERROR [RS_LOG_REPLAY_OPS-regionserver/h1:16020-0] 
> regionserver.HRegionServer: ***** ABORTING region server 
> h1,16020,1564404572589: Caught throwable while processing event RS_LOG_REPLAY 
> *****
> java.lang.ArrayIndexOutOfBoundsException: 16403
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)
> at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)
> at 
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)
> at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143)
> at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
> at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100)
> at 
> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:70)
> at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to