[jira] [Commented] (HBASE-26036) DBB released too early and dirty data for some operations

Xiaolin Ha (Jira) Sun, 04 Jul 2021 20:54:09 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-26036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17374486#comment-17374486
 ]


Xiaolin Ha commented on HBASE-26036:
------------------------------------

Hi, [~zhangduo], we can see that inner the HReigon.getInternal(), the 
HRegion.get() itself closed the scanner right after scanning, it is a normal 
action,
{code:java}
Scan scan = new Scan(get);
if (scan.getLoadColumnFamiliesOnDemandValue() == null) {
  scan.setLoadColumnFamiliesOnDemand(isLoadingCfsOnDemandDefault());
}
try (RegionScanner scanner = getScanner(scan, null, nonceGroup, nonce)) {
  scanner.next(results);
}
{code}
but it doesn't think about the outer methods, which won't know the DBBs of 
results are released.

To fix HRegion.get(), I thought there are two ways,
 # add the scanner above to the close callback of the RPC call, but it is a 
little strange for the region to get the caller, and there is already get(Get 
get, HRegion region, RegionScannersCloseCallBack closeCallBack, RpcCallContext 
context) in the RSRpcServices. What's more, for the operations like 
checkAndPut, the DBBs that get results used should be released as early as 
possibly, it need not to keep until the end of RPC call.
 # before return the results, copy them into heap. But this may brings larger 
heap pressure.

Looking through the places that HRegion.get() is used, mostly of them are for 
test purposes, except where I changed in the PR.

Do you have any advise for fixing this issue?

Thanks.

 

> DBB released too early and dirty data for some operations
> ---------------------------------------------------------
>
>                 Key: HBASE-26036
>                 URL: https://issues.apache.org/jira/browse/HBASE-26036
>             Project: HBase
>          Issue Type: Bug
>          Components: rpc
>    Affects Versions: 3.0.0-alpha-1, 2.0.0
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Critical
>
> Before HBASE-25187, we found there are regionserver JVM crashing problems on 
> our production clusters, the coredump infos are as follows,
> {code:java}
> Stack: [0x00007f621ba8d000,0x00007f621bb8e000],  sp=0x00007f621bb8c0e0,  free 
> space=1020k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> J 10829 C2 org.apache.hadoop.hbase.ByteBufferKeyValue.getTimestamp()J (9 
> bytes) @ 0x00007f6a5ee11b2d [0x00007f6a5ee11ae0+0x4d]
> J 22844 C2 
> org.apache.hadoop.hbase.regionserver.HRegion.doCheckAndRowMutate([B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/client/RowMutations;Lorg/apache/hadoop/hbase/client/Mutation;Z)Z
>  (540 bytes) @ 0x00007f6a60bed144 [0x00007f6a60beb320+0x1e24]
> J 17972 C2 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkAndRowMutate(Lorg/apache/hadoop/hbase/regionserver/Region;Ljava/util/List;Lorg/apache/hadoop/hbase/CellScanner;[B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$RegionActionResult$Builder;)Z
>  (312 bytes) @ 0x00007f6a5f4a7ed0 [0x00007f6a5f4a6f40+0xf90]
> J 26197 C2 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(Lorg/apache/hbase/thirdparty/com/google/protobuf/RpcController;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiRequest;)Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiResponse;
>  (644 bytes) @ 0x00007f6a61538b0c [0x00007f6a61537940+0x11cc]
> J 26332 C2 
> org.apache.hadoop.hbase.ipc.RpcServer.call(Lorg/apache/hadoop/hbase/ipc/RpcCall;Lorg/apache/hadoop/hbase/monitoring/MonitoredRPCHandler;)Lorg/apache/hadoop/hbase/util/Pair;
>  (566 bytes) @ 0x00007f6a615e8228 [0x00007f6a615e79c0+0x868]
> J 20563 C2 org.apache.hadoop.hbase.ipc.CallRunner.run()V (1196 bytes) @ 
> 0x00007f6a60711a4c [0x00007f6a60711000+0xa4c]
> J 19656% C2 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(Ljava/util/concurrent/BlockingQueue;Ljava/util/concurrent/atomic/AtomicInteger;)V
>  (338 bytes) @ 0x00007f6a6039a414 [0x00007f6a6039a320+0xf4]
> j  org.apache.hadoop.hbase.ipc.RpcExecutor$1.run()V+24
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub
> {code}
> I have made a UT to reproduce this error, it can occur 100%。
> After HBASE-25187，the check result of the checkAndMutate will be false, 
> because it read wrong/dirty data from the released ByteBuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-26036) DBB released too early and dirty data for some operations

Reply via email to