[
https://issues.apache.org/jira/browse/PIG-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838928#comment-15838928
]
Daniel Dai commented on PIG-4963:
---------------------------------
I glance through the patch and looks very good. I have some minor comments:
1. The documentation about the left outer join give the impression that user
can make bloom join efficient by switch the order of relations. Actually this
is the limitation of bloom join and switch order does not solve the problem. We
shall make it more clear.
2. Currently we use POBloomFilterRearrangeTez for the bloom filter. But I feel
it is more clear if the plan show a filter + regular local rearrange. The
execution plan of the later is more understandable.
3. The patch does have quite a few test coverage. However, we can run existing
join e2e tests once with bloom join, and make sure it works. That's an easy
approach for additional tests.
I still need more time to do a code level review, but I am fine to commit once
we have done #1, #3, and deal with #2 and other review comments in the follow
up Jiras.
> Add a Bloom join
> ----------------
>
> Key: PIG-4963
> URL: https://issues.apache.org/jira/browse/PIG-4963
> Project: Pig
> Issue Type: New Feature
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-4963-1.patch, PIG-4963-2.patch, PIG-4963-3.patch,
> PIG-4963-4.patch
>
>
> In PIG-4925, added option to pass BloomFilter as a scalar to bloom function.
> But found that actually using it for big data which required huge vector size
> was very inefficient and led to OOM.
> I had initially calculated that it would take around 12MB bytearray for
> 100 million vectorsize (100000000 + 7) / 8 = 12500000 bytes) and that would
> be the scalar value broadcasted and would not take much space. But problem is
> 12MB was written out for every input record with BuildBloom$Initial before
> the aggregation happens and we arrive at the final BloomFilter vector. And
> with POPartialAgg it runs into OOM issues.
> If we added a bloom join implementation, which can be combined with hash or
> skewed join it would boost performance for a lot of jobs. Bloom filter of the
> smaller tables can be sent to the bigger tables as scalar and data filtered
> before hash or skewed join is used.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)