[
https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581115#action_12581115
]
Andrzej Bialecki commented on HADOOP-3063:
-------------------------------------------
Re: replicating the source file - I agree, I just wasn't sure how to solve this.
My experience with these classes is that they are not well-debugged, so often
bug fixing is necessary. The original creator was an EU IST project, so it's
unlikely we could expect any maintenance from that project. Therefore I don't
think that packaging them as a third-party jar is a good option.
I think it would be best to import these classes into
org.apache.hadoop.util.bloom package, and keep the comment about the original
authors in the javadoc, as we do it now, and subsequently remove these classes
form HBase.
> BloomMapFile - fail-fast version of MapFile for sparsely populated key space
> ----------------------------------------------------------------------------
>
> Key: HADOOP-3063
> URL: https://issues.apache.org/jira/browse/HADOOP-3063
> Project: Hadoop Core
> Issue Type: Improvement
> Components: io
> Affects Versions: 0.17.0
> Reporter: Andrzej Bialecki
> Fix For: 0.17.0
>
> Attachments: bloommap.patch
>
>
> The need for this improvement arose when working with large ancillary
> MapFile-s (essentially used as external dictionaries). For each invokation of
> map() / reduce() it was necessary to perform several look-ups in these
> MapFile-s, and in case of sparsely populated key-space the cost of finding
> that a key is absent was too high.
> This patch implements a subclass of MapFile that creates a Bloom filter from
> all keys, so that accurate tests for absence of keys can be performed quickly
> and with 100% accuracy.
> Writer.append() operations update a DynamicBloomFilter, which is then
> serialized when the Writer is closed. This filter is loaded in memory when a
> Reader is created. Reader.get() operation first checks the filter for the key
> membership, and if the key is absent it immediately returns null without
> doing any further IO.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.