[
https://issues.apache.org/jira/browse/HBASE-15398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182990#comment-15182990
]
Phil Yang commented on HBASE-15398:
-----------------------------------
I check the comment in:
{code}
/**
* Check that given column family is essential for filter to check row. Most
filters always return
* true here. But some could have more sophisticated logic which could
significantly reduce
* scanning process by not even touching columns until we are 100% sure that
it's data is needed
* in result.
*
* Concrete implementers can signal a failure condition in their code by
throwing an
* {@link IOException}.
*
* @throws IOException in case an I/O or an filter specific failure needs to
be signaled.
*/
abstract public boolean isFamilyEssential(byte[] name) throws IOException;
{code}
and in HRegion
{code}
if (this.filter == null || !scan.doLoadColumnFamiliesOnDemand()
|| this.filter.isFamilyEssential(entry.getKey())) {
scanners.add(scanner);
} else {
joinedScanners.add(scanner);
}
{code}
So if we want the family be filtered, isFamilyEssential return true, right? And
if there is a cf that return false, we will have two heaps in RegionScanner so
we should make checkEssentialFilter return true?
> Cells loss or disorder when using family essential filter and partial
> scanning protocol
> ---------------------------------------------------------------------------------------
>
> Key: HBASE-15398
> URL: https://issues.apache.org/jira/browse/HBASE-15398
> Project: HBase
> Issue Type: Bug
> Components: dataloss, Scanners
> Affects Versions: 1.2.0, 1.1.3
> Reporter: Phil Yang
> Assignee: Phil Yang
> Priority: Critical
> Attachments: 15398-test.txt, HBASE-15398.v1.txt
>
>
> In RegionScannerImpl, we have two heaps, storeHeap and joinedHeap. If we have
> a filter and it doesn't apply to all cf, the stores whose families needn't be
> filtered will be in joinedHeap. We scan storeHeap first, then joinedHeap,
> and merge the results and sort and return to client. We need sort because the
> order of Cell is rowkey/cf/cq/ts and a smaller cf may be in the joinedHeap.
> However, after HBASE-11544 we may transfer partial results when we get
> SIZE_LIMIT_REACHED_MID_ROW or other similar states. We may return a larger cf
> first because it is in storeHeap and then a smaller cf because it is in
> joinedHeap. Server won't hold all cells in a row and client doesn't have a
> sorting logic. The order of cf in Result for user is wrong.
> And a more critical bug is, if we get a LIMIT_REACHED_MID_ROW on the last
> cell of a row in storeHeap, we will break scanning in RegionScannerImpl and
> in populateResult we will change the state to SIZE_LIMIT_REACHED because next
> peeked cell is next row. But this is only the last cell of one and we have
> two... And SIZE_LIMIT_REACHED means this Result is not partial (by
> ScannerContext.partialResultFormed), client will see it and merge them and
> return to user with losing data of joinedHeap. On next scan we will read next
> row of storeHeap and joinedHeap is forgotten and never be read...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)