[ https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847185#comment-16847185 ]
Zheng Hu commented on HBASE-22422: ---------------------------------- Understand now, it's a cnocurrent bug in RAMCache, say if thread1 try to getBlock as following: Step.1 : get the block1 from RAMCache#delegate; Step.2 : call the block1#retain to increase its refCnt; But another thread2 have flushed block into IOEngine and start clear the block from RAMCache: Step.a : get the block1 by RAMCache#delegate.remove; Step.b: call the block1#release to decrease its refCnt. If those steps above ordered as following: Step.1 : get the block1 from RAMCache#delegate; Step.a : get the block1 by RAMCache#delegate.remove; Step.b: call the block1#release to decrease its refCnt, here the refCnt decrease from 1 to 0; Step.2 : call the block1#retain to increase its refCnt; Then, the concurrent bug will occur. One way to fix this is : make the getAndRetain/removeAndRelease to be atomic. > Retain an ByteBuff with refCnt=0 when getBlock from LRUCache > ------------------------------------------------------------ > > Key: HBASE-22422 > URL: https://issues.apache.org/jira/browse/HBASE-22422 > Project: HBase > Issue Type: Sub-task > Components: BlockCache > Reporter: Zheng Hu > Assignee: Zheng Hu > Priority: Major > Attachments: 0001-debug2.patch, 0001-debug2.patch, 0001-debug2.patch, > 0001-debug3.patch, 0001-debug4.patch, HBASE-22422.HBASE-21879.v01.patch, > LRUBlockCache-getBlock.png, debug.patch, > failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png > > > After runing YCSB scan/get benchmark in our XiaoMi cluster, we found the get > QPS dropped from 25000/s to hunderds per second in a cluster with five > nodes. > After enable the debug log at YCSB client side, I found the following > stacktrace , see > https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png. > > After looking into the stractrace, I can ensure that the zero refCnt block is > an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png > Need a patch to fix this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)