[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

HBase Review Board (JIRA) Mon, 12 Jul 2010 12:08:18 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887472#action_12887472
 ]

HBase Review Board commented on HBASE-2794:
-------------------------------------------

Message from: "Nicolas" <[email protected]>

bq.  On 2010-07-12 10:17:25, Nicolas wrote:
bq.  > 
/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java, line 
857
bq.  > <http://review.hbase.org/r/296/diff/1/?file=2378#file2378line857>
bq.  >
bq.  >     have you done any tests to see when the number of bloom checks takes 
significant time compared to just getting the block?  For example, if you have 
100 columns to lookup, do bloom filters really buy you anything, or shouldn't 
you just switch to a Row-level bloom anyways?  Also, with a default 1% error 
rate, you're looking at ~100% false positive with 100 columns.  Maybe 
max.columns = sqrt(1/error.rate)
bq.  
bq.  Kris Jirapinyo wrote:
bq.      I have not, but would running on just the test data be sufficent to 
tell the true savings since the tests just run on mock data?  I don't really 
have a dev cluster with real data that I can test this on, so perhaps you or 
someone could help out in that regard.

BTW: Thanks for the work.  I don't think running on test data would be 
sufficient because you want to compare the speed of accessing a large bloom 
filter (which should have random access, aka L1 cache misses) with the cost of 
getting an HFile block from disk (with OS block cache miss).  If you can't 
setup a large cluster, one strategy might be to use 10ms as a the disk seek 
baseline and use testBloomPerf() in TestByteBloomFilter.java to estimate 
BloomFilter latency.  Ryan Rawson did some tests on using blooms with small KV 
entries.  He might be able to give you some numbers on when blooms do not take 
up too much memory (hopefully, some number like LV.length > 1KB).  You can then 
use the fact that HFiles are ~64MB to estimate a good entry sample size (I just 
picked 10M entries in current testBloomPerf() from thin air as a big number).  
Sounds a little complicated at first, but this strategy would probably take 
less time [and be more interesting] than trying to setup a genuine!
  cluster.

- Nicolas

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/296/#review350
-----------------------------------------------------------

> ROWCOL bloom filter not used if multiple columns within same family are 
> requested in a Get
> ------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2794
>                 URL: https://issues.apache.org/jira/browse/HBASE-2794
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>         Attachments: 2794_multi_column_check.txt
>
>
> Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
> {code}
>         switch(bloomFilterType) {
>           case ROW:
>             key = row;
>             break;
>           case ROWCOL:
>             if (columns.size() == 1) {
>               byte[] col = columns.first();
>               key = Bytes.add(row, col);
>               break;
>             }
>             //$FALL-THROUGH$
>           default:
>             return true;
>         }
> {code}
> If columns.size > 1, then we currently don't take advantage of the bloom 
> filter.  We should optimize this to check bloom for each of columns and if 
> none of the columns are present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get

Reply via email to