[ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887537#action_12887537 ]
Nicolas Spiegelberg commented on HBASE-2794: -------------------------------------------- IRC conversation about this... krispyjala: nspiegelberg: but is the test you want related to HBASE-2794 or just using bloom filter in general (e.g. when to use it and when not to)? [1:41pm] nspiegelberg: it's related to 2794 [1:42pm] nspiegelberg: an easy example of why you need good measurements is the case of calling bloom.contains() for 100 row+col in a 1% false positive bloom. You are getting almost 100% false positives then, so the bloom is an obvious perf drop [1:43pm] krispyjala: nspiegelberg: ok i think i understand [1:44pm] krispyjala: nspiegelberg: but wait 100% false positive? [1:46pm] nspiegelberg: right, so io.hfile.bloom.error.rate == .01 by default. so 1% [1:46pm] krispyjala: ok [1:46pm] krispyjala: how does that add up to 100% for 100 lookups? [1:46pm] nspiegelberg: therefore, if you call bloom.contains() 5 times and OR the result, the false positive rate is 5% [1:49pm] nspiegelberg: krispyjala: so a simple example. call bloom.contains() 10 times = 10% error rate = (10ms/seek * 10%) + time(bloom.contains) [1:50pm] krispyjala: nspiegelberg: but is it really OR'ing all of them? In the code if even one column lookup returns true we return true and don't look up any other columns [1:51pm] nspiegelberg: right, that's the same thing as ORing them [1:51pm] nspiegelberg: logical OR => || [1:52pm] krispyjala: nspiegelberg: but the point is we're probably not looking up 100 columns every time for that operation, even theoretically yes we do a logical OR [1:52pm] krispyjala: if we hit true on the 5th column, we quit the loop and return right away [1:53pm] nspiegelberg: the only way you win with blooms is if all bloom.contains() return false and you don't have to do the lookup [1:53pm] krispyjala: yes [1:53pm] nspiegelberg: so, you're right, we do an average of 50 lookups per false positive in this case. [1:54pm] nspiegelberg: I'm just saying, what is the cost of those 50 lookups? If 1ms, then every HFile seek costs 11ms with blooms enabled versus 10 ms without using them [1:55pm] krispyjala: but wait i thought the code was to determine whether to add the StoreScanner to the list or not...or are you saying then that the point is in the case of 100 columns we should just not even bother doing bloom multicolumn check because perhaps it's better to just load it than wasting time with the 100 lookups (potentially) [1:55pm] nspiegelberg: exactly [1:55pm] krispyjala: nspiegelberg: lol ok got it [1:56pm] krispyjala: but realistically, who does gets on 100 columns? I don't know the HBase internals well yet (that's why i picked the noob ticket lol)...wouldn't it be better to just do a get on the row? [1:57pm] nspiegelberg: never under-estimate the naivete of users [1:57pm] krispyjala: nspiegelberg: sigh lol, i guess that's why the bloom is off by default? [1:58pm] nspiegelberg: yes [1:58pm] nspiegelberg: so, it's obvious that you never want to run bloom code with 101 columns + 1% error rate [1:58pm] krispyjala: correct [1:59pm] nspiegelberg: so, really it's just timing testBloomPerf with various lookup counts on various size blooms [2:00pm] krispyjala: nspiegelberg: this talk has helped me think about how to test like you said [2:00pm] • St^Ack hopes the above good-stuff(tm) 'lesson' makes it back into the issue.... [2:00pm] nspiegelberg: looks like ryan didn't give you any concrete numbers, so you might have to just start with some assumptions (like, don't use blooms if avg key < 1KB) and run with that [2:01pm] krispyjala: nspiegelberg: and perhaps once we kind of know where the tradeoff is, would it be wrong to limit in the code saying if there are more than say 10 column lookups might as well just return true? [2:01pm] krispyjala: cuz it's not worth looking up in bloom at that point [2:01pm] nspiegelberg: I think that's exactly what we need to do [2:01pm] krispyjala: whatever the threshold is [2:02pm] nspiegelberg: if we pretend that the cost of bloom.contains() == 0, then maybe we want to say if (column.count * error.rate > 10%) return true; [2:02pm] dj_ryan: well it's hard to say where the tradeoff goes [2:02pm] krispyjala: pastebin? lol jk [2:02pm] dj_ryan: but the hard number is 9 bits/item [2:03pm] dj_ryan: you can then calculate how much ram you are spending on blooms [2:03pm] dj_ryan: and decide if its worth it [2:03pm] nspiegelberg: the hard # for 1% error rate blooms is 9 bits/item [2:03pm] dj_ryan: we never implemented blooms because it seemed 12gb of ram would be better off caching [2:03pm] krispyjala: dj_ryan: so your suggestion the onus is on the user and not hbase code [2:03pm] nspiegelberg: with .1% error rate, it's ~12 bits/item [2:04pm] krispyjala: or should we allow customizations of the limits? then we don't need to come up with the "recommended" threshold [2:05pm] dj_ryan: well [2:05pm] dj_ryan: it is up to the user [2:05pm] nspiegelberg: I think the onus for figuring out whether to use blooms or not is on the user, but we should still have a 'this is too stupid' early exit [2:05pm] dj_ryan: i mean maybe we could put a lot of metrics to detect when a bloom filter might be useful [2:05pm] dj_ryan: but im not sure that's worth it [2:06pm] krispyjala: dj_ryan: yes, but right now they can either just turn it on or off, and with my patch they will be forced to look up all the columns if they have more than one [2:07pm] nspiegelberg: krispyjala: I think just running testBloomPerf on a couple different sizes will give you a goo timing measurement. Unless my initial thoughts are off, you can probably just get away with saying: (column.count * error.rate > 10%) return true; [2:07pm] krispyjala: nspiegelberg: yeah I agree we should have the early exit strat [2:08pm] krispyjala: nspiegelberg: ok i will do some testing this evening on it [2:08pm] nspiegelberg: then, when somebody asks why you chose 10%, you can say that it obviously makes sense when below 10% and they should run some numbers for you if they want to pump it up > ROWCOL bloom filter not used if multiple columns within same family are > requested in a Get > ------------------------------------------------------------------------------------------ > > Key: HBASE-2794 > URL: https://issues.apache.org/jira/browse/HBASE-2794 > Project: HBase > Issue Type: Improvement > Reporter: Kannan Muthukkaruppan > > Noticed the following snippet in StoreFile.java:Scanner:shouldSeek(): > {code} > switch(bloomFilterType) { > case ROW: > key = row; > break; > case ROWCOL: > if (columns.size() == 1) { > byte[] col = columns.first(); > key = Bytes.add(row, col); > break; > } > //$FALL-THROUGH$ > default: > return true; > } > {code} > If columns.size > 1, then we currently don't take advantage of the bloom > filter. We should optimize this to check bloom for each of columns and if > none of the columns are present in the bloom avoid opening the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.