[ 
https://issues.apache.org/jira/browse/HBASE-14269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14712582#comment-14712582
 ] 

hongbin ma commented on HBASE-14269:
------------------------------------


As [~vrodionov] suggested I performed another test with 
TestFuzzyRowFilterEndToEnd:

50 fuzzykeys:

||  ||Pre HBASE-13761||HBASE-14269||
|runTest1's first run|180ms|204ms|
|runTest1's second run|82ms|62ms|
|runTest2's first run|209ms|230ms|
|runTest2's second run|92ms|109ms|

100 fuzzykeys: 
||  ||Pre HBASE-13761||HBASE-14269||
|runTest1's first run|183ms|177ms|
|runTest1's second run|82ms|56ms|
|runTest2's first run|218ms|214ms|
|runTest2's second run|98ms|107ms|

500 fuzykeys:
||  ||Pre HBASE-13761||HBASE-14269||
|runTest1's first run|184ms|192ms|
|runTest1's second run|72ms|61ms|
|runTest2's first run|260ms|226ms|
|runTest2's second run|127ms|101ms|

Unfortunately the I don't think post HBASE-13761 optimizations have boost the 
performance very much.
I don't have the condition to profile very large dataset,  [~vrodionov] will 
you please share your numbers?

Despite the bad news, the bug in HBASE-13761 still needs to be fixed.
It seems the new approach is not degrading when compared with  pre HBASE-13761. 
And as the number of fuzzykeys increases, HBASE-14269 tends to be faster than  
pre HBASE-13761.
So I thinks it is okay to commit this patch.

Suggestions for FuzzyRowFilter users:
FuzzyRowFilter is good when you have handful of fuzzy filters, when the number 
of fuzzy filters grow out of control (In apache Kylin we witnessed user queries 
caused using more than 100000 fuzzy filters) Normal it will bring more 
performance issues than benefit.

> FuzzyRowFilter omits certain rows when multiple fuzzy key exist
> ---------------------------------------------------------------
>
>                 Key: HBASE-14269
>                 URL: https://issues.apache.org/jira/browse/HBASE-14269
>             Project: HBase
>          Issue Type: Bug
>          Components: Filters
>            Reporter: hongbin ma
>            Assignee: hongbin ma
>             Fix For: 2.0.0, 1.2.0, 1.3.0, 0.98.15, 1.0.3, 1.1.3
>
>         Attachments: HBASE-14269-v1.patch, HBASE-14269-v2.patch, 
> HBASE-14269.patch
>
>
> https://issues.apache.org/jira/browse/HBASE-13761 introduced a RowTracker in 
> FuzzyRowFilter to avoid performing getNextForFuzzyRule() for each fuzzy key 
> on each getNextCellHint() by maintaining a list of possible row matches for 
> each fuzzy key. The implementation assumes that the prepared rows will be 
> matched one by one, so it removes the first row in the list as soon as it is 
> used. However, this approach may lead to omitting rows in some cases:
> Consider a case where we have two fuzzy keys:
> 1?1
> 2?2
> and the data is like:
> 000
> 111
> 112
> 121
> 122
> 211
> 212
> when the first row 000 fails to match, RowTracker will update possible row 
> matches with cell 000 and fuzzy keys 1?1,2?2. This will populate RowTracker 
> with 101 and 202. Then 101 is popped out of RowTracker, hint the scanner to 
> go to row 101. The scanner will get 111 and find it is a match, and continued 
> to find that 112 is not a match, getNextCellHint will be called again. Then 
> comes the bug: Row 101 has been removed out of RowTracker, so RowTracker will 
> jump to 202. As you see row 121 will be omitted, but it is actually a match 
> for fuzzy key 1?1.
> I will illustrate the bug by adding a new test case in 
> TestFuzzyRowFilterEndToEnd. Also I will provide the bug fix in my patch. The 
> idea of the new solution is to maintain a priority queue for all the possible 
> match rows for each fuzzy key, and whenever getNextCellHint is called, the 
> elements in the queue that are smaller than the parameter currentCell will be 
> updated(and re-insert into the queue). The head of queue will always be the 
> "Next cell hint".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to