[
https://issues.apache.org/jira/browse/HBASE-7885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lars Hofhansl updated HBASE-7885:
---------------------------------
Fix Version/s: (was: 0.94.6)
0.94.7
Moving into 0.94.7 (just a few weeks out, really)
Personally I do not know much about bloom filters, would like to know why the
estimated error rate is incorrect.
> bloom filter compaction is too aggressive for Hfile which only contains small
> count of records
> ----------------------------------------------------------------------------------------------
>
> Key: HBASE-7885
> URL: https://issues.apache.org/jira/browse/HBASE-7885
> Project: HBase
> Issue Type: Bug
> Components: Performance, Scanners
> Affects Versions: 0.94.5
> Reporter: clockfly
> Assignee: clockfly
> Priority: Minor
> Fix For: 0.94.7
>
> Attachments: hbase-7885.patch, hbase_bloom_shrink_fix.patch
>
>
> For HFile V2, the bloom filter will take a initial size, 128KB.
> When there are not that much records inserted into the bloom filter, the
> bloom fitler will start to shrink itself to do compaction.
> For example, for 128K, it will compact to 64K
> ->32K->16K->8K->4K->2K->1K->512->256->128->64->32, as long as it think that
> it can be bounded by the estimate error rate.
> If we puts only a few records in the HFile, the bloom filter will be
> compacted to too small, then it will break the assumption that shrinking will
> still be bounded by the estimated error rate. The False positive rate will
> becomes un-acceptable high.
> For example, if we set the expected error rate is 0.00001, for 10 records,
> after compaction, The size of the bloom filter will be 64 bytes. The real
> effective false positive rate will be 50%.
> The use case is like this, if we are using HBase to store big record like
> images, and binaries, each record will take megabytes. Then for a 128M file,
> it will only contains dozens of records.
> The suggested fix is to set a lower limit for the bloom filter compaction
> process. I suggest to use 1000 bytes.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira