[
https://issues.apache.org/jira/browse/HBASE-14269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706835#comment-14706835
]
hongbin ma commented on HBASE-14269:
------------------------------------
Hi,
When I compared the performance of differnt versions of FuzzyRowFilter I'm
surprised that they exhibit no significant performance difference at all. The
three versions that I compared are:
1. HBASE-13641: this is the last change before HBASE-13761's optimization
2. HBASE-13761: this is where the RowTracker optimization is introduced
3. HBASE-14269: this is where we fixed the bug in HBASE-13761
I modified the synthetic data settings in TestFuzzyRowFilterEndToEnd to make it
more evenly distributed(see my latest patch). And here're the results for
running testEndToEnd on three version of FuzzyRowFilter respectively:
|| ||HBASE-13641||HBASE-13761||HBASE-14269||
| runTest1's first run|138ms|130ms|133ms|
| runTest1's second run|62ms|65ms|63ms|
| runTest2's first run|183ms|94ms before assertion error|194ms|
| runTest2's second run|53ms|N/A skipped because error|53ms|
All of the three versions will succeed in runTest1, however HBASE-13761 will
fail due to the bug we reported in this issue.(It will return incomplete result
ret). As we can see there's no significant difference between them. It is
reported in HBASE-13761 that its optimization has boost the performance a lot,
however we guess it might be fast due to incomplete result set.
Fuzzy filter is nothing magic, when it works it follows the pattern:
(get row)(get hint)...(get row)(get hint)
The RowTracker optimization in HBASE-13761 is merely optimizing the "get hint"
part, and it remain doubt whether "get hint" part is the real bottleneck.
If my benchmark based on minicluster is not convincing enough, please point out
why, and show us another re-produceable benchmark. For now my conclusion would
be: HBASE-13761 did not optimize FuzzyRowFilter that much, we should think
about reverting the FuzzyRowFilter to original version, or simply merge the
patch in this issue. (No significant difference in terms of performance)
> FuzzyRowFilter omits certain rows when multiple fuzzy key exist
> ---------------------------------------------------------------
>
> Key: HBASE-14269
> URL: https://issues.apache.org/jira/browse/HBASE-14269
> Project: HBase
> Issue Type: Bug
> Components: Filters
> Reporter: hongbin ma
> Assignee: hongbin ma
> Attachments: HBASE-14269-v1.patch, HBASE-14269.patch
>
>
> https://issues.apache.org/jira/browse/HBASE-13761 introduced a RowTracker in
> FuzzyRowFilter to avoid performing getNextForFuzzyRule() for each fuzzy key
> on each getNextCellHint() by maintaining a list of possible row matches for
> each fuzzy key. The implementation assumes that the prepared rows will be
> matched one by one, so it removes the first row in the list as soon as it is
> used. However, this approach may lead to omitting rows in some cases:
> Consider a case where we have two fuzzy keys:
> 1?1
> 2?2
> and the data is like:
> 000
> 111
> 112
> 121
> 122
> 211
> 212
> when the first row 000 fails to match, RowTracker will update possible row
> matches with cell 000 and fuzzy keys 1?1,2?2. This will populate RowTracker
> with 101 and 202. Then 101 is popped out of RowTracker, hint the scanner to
> go to row 101. The scanner will get 111 and find it is a match, and continued
> to find that 112 is not a match, getNextCellHint will be called again. Then
> comes the bug: Row 101 has been removed out of RowTracker, so RowTracker will
> jump to 202. As you see row 121 will be omitted, but it is actually a match
> for fuzzy key 1?1.
> I will illustrate the bug by adding a new test case in
> TestFuzzyRowFilterEndToEnd. Also I will provide the bug fix in my patch. The
> idea of the new solution is to maintain a priority queue for all the possible
> match rows for each fuzzy key, and whenever getNextCellHint is called, the
> elements in the queue that are smaller than the parameter currentCell will be
> updated(and re-insert into the queue). The head of queue will always be the
> "Next cell hint".
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)