BloomMapFile - fail-fast version of MapFile for sparsely populated key space
----------------------------------------------------------------------------
Key: HADOOP-3063
URL: https://issues.apache.org/jira/browse/HADOOP-3063
Project: Hadoop Core
Issue Type: Improvement
Components: io
Affects Versions: 0.17.0
Reporter: Andrzej Bialecki
Fix For: 0.17.0
The need for this improvement arose when working with large ancillary MapFile-s
(essentially used as external dictionaries). For each invokation of map() /
reduce() it was necessary to perform several look-ups in these MapFile-s, and
in case of sparsely populated key-space the cost of finding that a key is
absent was too high.
This patch implements a subclass of MapFile that creates a Bloom filter from
all keys, so that accurate tests for absence of keys can be performed quickly
and with 100% accuracy.
Writer.append() operations update a DynamicBloomFilter, which is then
serialized when the Writer is closed. This filter is loaded in memory when a
Reader is created. Reader.get() operation first checks the filter for the key
membership, and if the key is absent it immediately returns null without doing
any further IO.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.