[ https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990932#comment-16990932 ]
Viraj Jasani commented on HBASE-22457: -------------------------------------- {quote} # detect run-away refCount by comparing any reader's refCount with the actual number of open scanners (which we track for each HRegion) if the refCount is larger we know we have a problem. # (a variation) when we attempt to archive an HFile that has refCount, check if there're any open scanners, if not archive anyway. For #1 at least we could enhance the logging and include the number of currently scanners in the log (where we say that we cannot archive an HFile) {quote} For #1, since open scanners are tracked at HRegion(RegionScannerImpl) and not at Store level, we might not be able to compare refCount with open scanners? Also, #2 might also not be true due to open scanners at region level (non-compacted store files) and not at store level? I was thinking if we can also track no of open scanners at store level. I was just going though comments here while looking into HBASE-23349 (refCount 1 preventing archival of compacted store files) > Harden the HBase HFile reader reference counting > ------------------------------------------------ > > Key: HBASE-22457 > URL: https://issues.apache.org/jira/browse/HBASE-22457 > Project: HBase > Issue Type: Brainstorming > Reporter: Lars Hofhansl > Priority: Major > Attachments: 22457-random-1.5.txt > > > The problem that any coprocessor hook that replaces a passed scanner without > closing it can cause an incorrect reference count. > This was bad and wrong before of course, but now it has pretty bad > consequences, since an incorrect reference could will prevent HFiles from > being archived indefinitely. > All hooks that are passed a scanner and return a scanner are suspect, since > the returned scanner may or may not close the passed scanner: > * preCompact > * preCompactScannerOpen > * preFlush > * preFlushScannerOpen > * preScannerOpen > * preStoreScannerOpen > * preStoreFileReaderOpen...? (not sure about this one, it could mess with the > reader) > I sampled the Phoenix and also Tephra code, and found a few instances where > this is happening. > And for those I filed issued: TEPHRA-300, PHOENIX-5291 > (We're not using Tephra) > The Phoenix ones should be rare. In our case we are seeing readers with > refCount > 1000. > Perhaps there are other issues, a path where not all exceptions are caught > and scanner is left open that way perhaps. (Generally I am not a fan of > reference counting in complex systems - it's too easy to miss something. But > that's a different discussion. :) ). > Let's brainstorm some way in which we can harden this. > [~ram_krish], [~anoop.hbase], [~apurtell] -- This message was sent by Atlassian Jira (v8.3.4#803005)