[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5342:
------------------------------------
    Description: 
1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
join. When the keys are all unique, the combiner is unnecessary overhead.
2) In previous case, the keys were the bloom filter index and the values were 
the join key. Combining involved doing a distinct on the bag of values which 
has memory issues for more than 10 million records. That needs to be flipped 
and distinct combiner used to scale to a billions of records.
3) Mention in documentation that bloom join is also ideal in cases of right 
outer join with smaller dataset on the right. Replicate join only supports left 
outer join.

 

  was:
1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
join. When the keys are all unique, the combiner is unnecessary overhead.
2) Mention in documentation that bloom join is also ideal in cases of right 
outer join with smaller dataset on the right. Replicate join only supports left 
outer join.

 


1) pkgr.setKeyType(DataType.INTEGER); should go in createBloomInMap and 
pkg.getPkgr().setKeyType(op.getPkgr().getKeyType()); should go in else clause. 
Not sure how it is working. The golden files also don't look write for the map 
case - key is showing as bytearray instead of int because of that.
2)  if (pkg.getPkgr() instanceof BloomPackager )  should be (pkg.getPkgr() 
instanceof BloomPackager && pkgr.isBloomCreatedInMap())
3) Please update one of the e2e tests in join.conf with a different value for 
pig.bloomjoin.num.filters

> Add setting to turn off bloom join combiner
> -------------------------------------------
>
>                 Key: PIG-5342
>                 URL: https://issues.apache.org/jira/browse/PIG-5342
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Satish Subhashrao Saley
>            Assignee: Satish Subhashrao Saley
>            Priority: Major
>         Attachments: PIG-5342-1.patch, PIG-5342-2.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to