[ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887537#action_12887537
 ] 

Nicolas Spiegelberg commented on HBASE-2794:
--------------------------------------------

IRC conversation about this...

 krispyjala: nspiegelberg: but is the test you want related to HBASE-2794 or 
just using bloom filter in general (e.g. when to use it and when not to)?
 [1:41pm] nspiegelberg: it's related to 2794
 [1:42pm] nspiegelberg: an easy example of why you need good measurements is 
the case of calling bloom.contains() for 100 row+col in a 1% false positive 
bloom.  You are getting almost 100% false positives then, so the bloom is an 
obvious perf drop
 [1:43pm] krispyjala: nspiegelberg: ok i think i understand
 [1:44pm] krispyjala: nspiegelberg: but wait 100% false positive?
 [1:46pm] nspiegelberg: right, so io.hfile.bloom.error.rate == .01 by default.  
so 1%
 [1:46pm] krispyjala: ok
 [1:46pm] krispyjala: how does that add up to 100% for 100 lookups?
 [1:46pm] nspiegelberg: therefore, if you call bloom.contains() 5 times and OR 
the result, the false positive rate is 5%
 [1:49pm] nspiegelberg: krispyjala: so a simple example. call bloom.contains() 
10 times = 10% error rate = (10ms/seek * 10%) + time(bloom.contains)
 [1:50pm] krispyjala: nspiegelberg: but is it really OR'ing all of them? In the 
code if even one column lookup returns true we return true and don't look up 
any other columns
 [1:51pm] nspiegelberg: right, that's the same thing as ORing them
 [1:51pm] nspiegelberg: logical OR =>  ||
 [1:52pm] krispyjala: nspiegelberg: but the point is we're probably not looking 
up 100 columns every time for that operation, even theoretically yes we do a 
logical OR
 [1:52pm] krispyjala: if we hit true on the 5th column, we quit the loop and 
return right away
 [1:53pm] nspiegelberg: the only way you win with blooms is if all 
bloom.contains() return false and you don't have to do the lookup
 [1:53pm] krispyjala: yes
 [1:53pm] nspiegelberg: so, you're right, we do an average of 50 lookups per 
false positive in this case.
 [1:54pm] nspiegelberg: I'm just saying, what is the cost of those 50 lookups?  
If 1ms, then every HFile seek costs 11ms with blooms enabled versus 10 ms 
without using them
 [1:55pm] krispyjala: but wait i thought the code was to determine whether to 
add the StoreScanner to the list or not...or are you saying then that the point 
is in the case of 100 columns we should just not even bother doing bloom 
multicolumn check because perhaps it's better to just load it than wasting time 
with the 100 lookups (potentially)
 [1:55pm] nspiegelberg: exactly
 [1:55pm] krispyjala: nspiegelberg: lol ok got it
 [1:56pm] krispyjala: but realistically, who does gets on 100 columns? I don't 
know the HBase internals well yet (that's why i picked the noob ticket 
lol)...wouldn't it be better to just do a get on the row?
 [1:57pm] nspiegelberg: never under-estimate the naivete of users
 [1:57pm] krispyjala: nspiegelberg: sigh lol, i guess that's why the bloom is 
off by default?
 [1:58pm] nspiegelberg: yes
 [1:58pm] nspiegelberg: so, it's obvious that you never want to run bloom code 
with 101 columns + 1% error rate
 [1:58pm] krispyjala: correct
 [1:59pm] nspiegelberg: so, really it's just timing testBloomPerf with various 
lookup counts on various size blooms
 [2:00pm] krispyjala: nspiegelberg: this talk has helped me think about how to 
test like you said
 [2:00pm] • St^Ack hopes the above good-stuff(tm) 'lesson' makes it back into 
the issue....
 [2:00pm] nspiegelberg: looks like ryan didn't give you any concrete numbers, 
so you might have to just start with some assumptions (like, don't use blooms 
if avg key < 1KB) and run with that
 [2:01pm] krispyjala: nspiegelberg: and perhaps once we kind of know where the 
tradeoff is, would it be wrong to limit in the code saying if there are more 
than say 10 column lookups might as well just return true?
 [2:01pm] krispyjala: cuz it's not worth looking up in bloom at that point
 [2:01pm] nspiegelberg: I think that's exactly what we need to do
 [2:01pm] krispyjala: whatever the threshold is
 [2:02pm] nspiegelberg: if we pretend that the cost of bloom.contains() == 0, 
then maybe we want to say if (column.count * error.rate > 10%) return true;
 [2:02pm] dj_ryan: well it's hard to say where the tradeoff goes
 [2:02pm] krispyjala: pastebin? lol jk
 [2:02pm] dj_ryan: but the hard number is 9 bits/item
 [2:03pm] dj_ryan: you can then calculate how much ram you are spending on 
blooms
 [2:03pm] dj_ryan: and decide if its worth it
 [2:03pm] nspiegelberg: the hard # for 1% error rate blooms is 9 bits/item
 [2:03pm] dj_ryan: we never implemented blooms because it seemed 12gb of ram 
would be better off caching
 [2:03pm] krispyjala: dj_ryan: so your suggestion the onus is on the user and 
not hbase code
 [2:03pm] nspiegelberg: with .1% error rate, it's ~12 bits/item
 [2:04pm] krispyjala: or should we allow customizations of the limits? then we 
don't need to come up with the "recommended" threshold
 [2:05pm] dj_ryan: well
 [2:05pm] dj_ryan: it is up to the user
 [2:05pm] nspiegelberg: I think the onus for figuring out whether to use blooms 
or not is on the user, but we should still have a 'this is too stupid' early 
exit
 [2:05pm] dj_ryan: i mean maybe we could put a lot of metrics to detect when a 
bloom filter might be useful
 [2:05pm] dj_ryan: but im not sure that's worth it 
 [2:06pm] krispyjala: dj_ryan: yes, but right now they can either just turn it 
on or off, and with my patch they will be forced to look up all the columns if 
they have more than one
 [2:07pm] nspiegelberg: krispyjala: I think just running testBloomPerf on a 
couple different sizes will give you a goo timing measurement.  Unless my 
initial thoughts are off, you can probably just get away with saying: 
(column.count * error.rate > 10%) return true;
 [2:07pm] krispyjala: nspiegelberg: yeah I agree we should have the early exit 
strat
 [2:08pm] krispyjala: nspiegelberg: ok i will do some testing this evening on it
 [2:08pm] nspiegelberg: then, when somebody asks why you chose 10%, you can say 
that it obviously makes sense when below 10% and they should run some numbers 
for you if they want to pump it up 
 

> ROWCOL bloom filter not used if multiple columns within same family are 
> requested in a Get
> ------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2794
>                 URL: https://issues.apache.org/jira/browse/HBASE-2794
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>
> Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
> {code}
>         switch(bloomFilterType) {
>           case ROW:
>             key = row;
>             break;
>           case ROWCOL:
>             if (columns.size() == 1) {
>               byte[] col = columns.first();
>               key = Bytes.add(row, col);
>               break;
>             }
>             //$FALL-THROUGH$
>           default:
>             return true;
>         }
> {code}
> If columns.size > 1, then we currently don't take advantage of the bloom 
> filter.  We should optimize this to check bloom for each of columns and if 
> none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to