[ 
https://issues.apache.org/jira/browse/PIG-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-2328:
----------------------------

    Release Note: 
Bloom filters are a common way to select a limited set of records before moving 
data for a join or other heavy weight operation.  For example, if one wanted to 
join a very large data set L with a smaller set S, and it was known that the 
number of keys in L that will match with S is small, building a bloom filter on 
S and then applying it to L before the join can greatly reduce the number of 
records from L that have to be moved from the map to the reduce, thus speeding 
the join.

{code}
define bb BuildBloom('128', '3', 'jenkins');
small = load 'S' as (x, y, z);
grpd  = group small all;
fltrd  = foreach grpd generate bb(small.x);
store fltrd in 'mybloom';
exec;
define bloom Bloom('mybloom');
large = load 'L' as (a, b, c);
flarge = filter large by bloom(L.a);
joined = join small by x, flarge by a;
store joined into 'results';
{code}

When constructing BuildBloom, the three arguments passed are the number of bits 
in the bloom filter, the number of hash functions used in constructing the 
bloom filter, and the type of hash function to use.  Valid values for the hash 
functions are 'jenkins' and 'murmur'.  See 
http://en.wikipedia.org/wiki/Bloom_filter for a discussion of how to select the 
number of bits and the number of hash functions.

This uses Hadoop's bloom filters (org.apache.hadoop.util.bloom.BloomFilter) 
internally.
    
> Add builtin UDFs for building and using bloom filters
> -----------------------------------------------------
>
>                 Key: PIG-2328
>                 URL: https://issues.apache.org/jira/browse/PIG-2328
>             Project: Pig
>          Issue Type: New Feature
>          Components: internal-udfs
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: 0.10
>
>         Attachments: PIG-bloom.patch
>
>
> Bloom filters are a common way to do select a limited set of records before 
> moving data for a join or other heavy weight operation.  Pig should add UDFs 
> to support building and using bloom filters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to