[ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Chansler updated HADOOP-3063: ------------------------------------ Fix Version/s: (was: 0.17.0) > BloomMapFile - fail-fast version of MapFile for sparsely populated key space > ---------------------------------------------------------------------------- > > Key: HADOOP-3063 > URL: https://issues.apache.org/jira/browse/HADOOP-3063 > Project: Hadoop Core > Issue Type: Improvement > Components: io > Affects Versions: 0.17.0 > Reporter: Andrzej Bialecki > Attachments: bloommap.patch > > > The need for this improvement arose when working with large ancillary > MapFile-s (essentially used as external dictionaries). For each invokation of > map() / reduce() it was necessary to perform several look-ups in these > MapFile-s, and in case of sparsely populated key-space the cost of finding > that a key is absent was too high. > This patch implements a subclass of MapFile that creates a Bloom filter from > all keys, so that accurate tests for absence of keys can be performed quickly > and with 100% accuracy. > Writer.append() operations update a DynamicBloomFilter, which is then > serialized when the Writer is closed. This filter is loaded in memory when a > Reader is created. Reader.get() operation first checks the filter for the key > membership, and if the key is absent it immediately returns null without > doing any further IO. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.