[
https://issues.apache.org/jira/browse/HBASE-26036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17374486#comment-17374486
]
Xiaolin Ha commented on HBASE-26036:
------------------------------------
Hi, [~zhangduo], we can see that inner the HReigon.getInternal(), the
HRegion.get() itself closed the scanner right after scanning, it is a normal
action,
{code:java}
Scan scan = new Scan(get);
if (scan.getLoadColumnFamiliesOnDemandValue() == null) {
scan.setLoadColumnFamiliesOnDemand(isLoadingCfsOnDemandDefault());
}
try (RegionScanner scanner = getScanner(scan, null, nonceGroup, nonce)) {
scanner.next(results);
}
{code}
but it doesn't think about the outer methods, which won't know the DBBs of
results are released.
To fix HRegion.get(), I thought there are two ways,
# add the scanner above to the close callback of the RPC call, but it is a
little strange for the region to get the caller, and there is already get(Get
get, HRegion region, RegionScannersCloseCallBack closeCallBack, RpcCallContext
context) in the RSRpcServices. What's more, for the operations like
checkAndPut, the DBBs that get results used should be released as early as
possibly, it need not to keep until the end of RPC call.
# before return the results, copy them into heap. But this may brings larger
heap pressure.
Looking through the places that HRegion.get() is used, mostly of them are for
test purposes, except where I changed in the PR.
Do you have any advise for fixing this issue?
Thanks.
> DBB released too early and dirty data for some operations
> ---------------------------------------------------------
>
> Key: HBASE-26036
> URL: https://issues.apache.org/jira/browse/HBASE-26036
> Project: HBase
> Issue Type: Bug
> Components: rpc
> Affects Versions: 3.0.0-alpha-1, 2.0.0
> Reporter: Xiaolin Ha
> Assignee: Xiaolin Ha
> Priority: Critical
>
> Before HBASE-25187, we found there are regionserver JVM crashing problems on
> our production clusters, the coredump infos are as follows,
> {code:java}
> Stack: [0x00007f621ba8d000,0x00007f621bb8e000], sp=0x00007f621bb8c0e0, free
> space=1020k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
> code)
> J 10829 C2 org.apache.hadoop.hbase.ByteBufferKeyValue.getTimestamp()J (9
> bytes) @ 0x00007f6a5ee11b2d [0x00007f6a5ee11ae0+0x4d]
> J 22844 C2
> org.apache.hadoop.hbase.regionserver.HRegion.doCheckAndRowMutate([B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/client/RowMutations;Lorg/apache/hadoop/hbase/client/Mutation;Z)Z
> (540 bytes) @ 0x00007f6a60bed144 [0x00007f6a60beb320+0x1e24]
> J 17972 C2
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkAndRowMutate(Lorg/apache/hadoop/hbase/regionserver/Region;Ljava/util/List;Lorg/apache/hadoop/hbase/CellScanner;[B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$RegionActionResult$Builder;)Z
> (312 bytes) @ 0x00007f6a5f4a7ed0 [0x00007f6a5f4a6f40+0xf90]
> J 26197 C2
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(Lorg/apache/hbase/thirdparty/com/google/protobuf/RpcController;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiRequest;)Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiResponse;
> (644 bytes) @ 0x00007f6a61538b0c [0x00007f6a61537940+0x11cc]
> J 26332 C2
> org.apache.hadoop.hbase.ipc.RpcServer.call(Lorg/apache/hadoop/hbase/ipc/RpcCall;Lorg/apache/hadoop/hbase/monitoring/MonitoredRPCHandler;)Lorg/apache/hadoop/hbase/util/Pair;
> (566 bytes) @ 0x00007f6a615e8228 [0x00007f6a615e79c0+0x868]
> J 20563 C2 org.apache.hadoop.hbase.ipc.CallRunner.run()V (1196 bytes) @
> 0x00007f6a60711a4c [0x00007f6a60711000+0xa4c]
> J 19656% C2
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(Ljava/util/concurrent/BlockingQueue;Ljava/util/concurrent/atomic/AtomicInteger;)V
> (338 bytes) @ 0x00007f6a6039a414 [0x00007f6a6039a320+0xf4]
> j org.apache.hadoop.hbase.ipc.RpcExecutor$1.run()V+24
> j java.lang.Thread.run()V+11
> v ~StubRoutines::call_stub
> {code}
> I have made a UT to reproduce this error, it can occur 100%。
> After HBASE-25187,the check result of the checkAndMutate will be false,
> because it read wrong/dirty data from the released ByteBuff.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)