[ 
https://issues.apache.org/jira/browse/PIG-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5255:
------------------------------------
    Attachment: PIG-5255-4.patch

> Improvements to bloom join
> --------------------------
>
>                 Key: PIG-5255
>                 URL: https://issues.apache.org/jira/browse/PIG-5255
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Rohini Palaniswamy
>            Assignee: Satish Subhashrao Saley
>            Priority: Major
>             Fix For: 0.18.0
>
>         Attachments: PIG-5255-4.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
> 3) Write own bloom implementation for Murmur3 and Murmur3 with Kirsch & 
> Mitzenmacher optimization which Cassandra uses 
> (http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html). 
> Currently we use Hadoop's bloomfilter implementation which only has Jenkins 
> and Murmur2. Murmur3 is faster and offers better distribution.
> 4) Move from BitSet to RoaringBitMap for
>   - Speed and better compression
>   - Scale
>   Currently bloom join does not scale for billions of keys. Really need large 
> bloom filters in those cases and cost of broadcasting those is greater than 
> actual data size. For eg: Join of 32B records (4TB of data) with 4 billion 
> records with keys being mostly unique. Lets say we construct  61 partitioned 
> bloom filters of 3MB each (still not good enough bit vector size for the 
> amount of keys) it is close to 200MB. If we broadcast 200MB to 30K tasks it 
> becomes 6TB which is higher than the actual data size. In practice broadcast 
> would only download once per node. Even considering that in a 6K nodes 
> cluster the amount of data transfer would be around 1.2TB. Using 
> RoaringBitMap should make a big difference in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to