[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846109#comment-16846109
 ] 

Andrew Purtell commented on HBASE-22457:
----------------------------------------

[~lhofhansl] 

bq. Are there any "safepoints"? I.e. points where we know what the reference 
could should be, so we can reset it? An obvious one is when there are no 
scanners running at all; in that case we could reset all refCounts to 0.

Interesting idea. Debating if it's a bit of a hack. In a refcounting system the 
number of resource acquisitions should equal the number of releases or else 
it's a bug. By hiding this class of bugs by behind the scenes twiddling with 
the counter would we encourage sloppy code? 

If considering hacks, a decent mitigation for leaks would be a fast close and 
reopen of the region on the same server (initiated by the RS) will release all 
resources, like the refcount, leases, etc. The clients should gracefully ride 
over this like any other region transition. If the refcount is over some 
ridiculous threshold this mitigation could be triggered along with a fat WARN 
in the logs. This is orthogonal to hardening though so if we want to pursue the 
idea I could open a subtask.

> Harden the HBase HFile reader reference counting
> ------------------------------------------------
>
>                 Key: HBASE-22457
>                 URL: https://issues.apache.org/jira/browse/HBASE-22457
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Lars Hofhansl
>            Priority: Major
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to