[ 
https://issues.apache.org/jira/browse/PIG-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5255:
------------------------------------
        Summary: Improvements to bloom join  (was: Add setting to turn off 
combiner for bloom join)
    Description: 

1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
join. When the keys are all unique, the combiner is unnecessary overhead.
2) Mention in documentation that bloom join is also ideal in cases of right 
outer join with smaller dataset on the right. Replicate join only supports left 
outer join.
3) Write own bloom implementation for Murmur3 and Murmur3 with Kirsch & 
Mitzenmacher optimization which Cassandra uses 
(http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html). 
Currently we use Hadoop's bloomfilter implementation which only has Jenkins and 
Murmur2. Murmur3 is faster and offers better distribution.

4) Move from BitSet to RoaringBitMap for
  - Speed and better compression
  - Scale
  Currently bloom join does not scale for billions of keys. Really need large 
bloom filters in those cases and cost of broadcasting those is greater than 
actual data size. For eg: Join of 32B records (4TB of data) with 4 billion 
records with keys being mostly unique. Lets say we construct  61 partitioned 
bloom filters of 3MB each (still not good enough bit vector size for the amount 
of keys) it is close to 200MB. If we broadcast 200MB to 30K tasks it becomes 
6TB which is higher than the actual data size. In practice broadcast would only 
download once per node. Even considering that in a 6K nodes cluster the amount 
of data transfer would be around 1.2TB. Using RoaringBitMap should make a big 
difference in this case.

  was:Need a new setting pig.bloomjoin.nocombiner to turn off combiner for 
bloom join. When the keys are all unique, the combiner is unnecessary overhead.


> Improvements to bloom join
> --------------------------
>
>                 Key: PIG-5255
>                 URL: https://issues.apache.org/jira/browse/PIG-5255
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.18.0
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
> 3) Write own bloom implementation for Murmur3 and Murmur3 with Kirsch & 
> Mitzenmacher optimization which Cassandra uses 
> (http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html). 
> Currently we use Hadoop's bloomfilter implementation which only has Jenkins 
> and Murmur2. Murmur3 is faster and offers better distribution.
> 4) Move from BitSet to RoaringBitMap for
>   - Speed and better compression
>   - Scale
>   Currently bloom join does not scale for billions of keys. Really need large 
> bloom filters in those cases and cost of broadcasting those is greater than 
> actual data size. For eg: Join of 32B records (4TB of data) with 4 billion 
> records with keys being mostly unique. Lets say we construct  61 partitioned 
> bloom filters of 3MB each (still not good enough bit vector size for the 
> amount of keys) it is close to 200MB. If we broadcast 200MB to 30K tasks it 
> becomes 6TB which is higher than the actual data size. In practice broadcast 
> would only download once per node. Even considering that in a 6K nodes 
> cluster the amount of data transfer would be around 1.2TB. Using 
> RoaringBitMap should make a big difference in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to