BloomMapFile - fail-fast version of MapFile for sparsely populated key space
----------------------------------------------------------------------------

                 Key: HADOOP-3063
                 URL: https://issues.apache.org/jira/browse/HADOOP-3063
             Project: Hadoop Core
          Issue Type: Improvement
          Components: io
    Affects Versions: 0.17.0
            Reporter: Andrzej Bialecki 
             Fix For: 0.17.0


The need for this improvement arose when working with large ancillary MapFile-s 
(essentially used as external dictionaries). For each invokation of map() / 
reduce() it was necessary to perform several look-ups in these MapFile-s, and 
in case of sparsely populated key-space the cost of finding that a key is 
absent was too high.

This patch implements a subclass of MapFile that creates a Bloom filter from 
all keys, so that accurate tests for absence of keys can be performed quickly 
and with 100% accuracy.

Writer.append() operations update a DynamicBloomFilter, which is then 
serialized when the Writer is closed. This filter is loaded in memory when a 
Reader is created. Reader.get() operation first checks the filter for the key 
membership, and if the key is absent it immediately returns null without doing 
any further IO.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to