Ashutosh Chauhan updated PIG-926:
The attached first patch runs the full pipeline of right side in indexer before
sampling the tuple from block. This has following advantages:
a) It addresses the concern which Pradeep pointed out in phase-1: "Strictly we
should not allow LOForeach since it could change sort order or position of join
keys and hence invalidate the index - but we need it so that the Foreach
introduced by the TypeCastInserter when there is a schema for either of the
inputs remains." Now since pipeline is run before sampling the tuple, this
becomes a non-issue.
b) Currently type information doesn't make it to the POSort which sorts the
index entries in reduce task of index job. This works due to other reasons, but
this patch fixes this.
c) It will improve on performance. Instead of always sampling the first record
of the block, index now contains the entry of first record in the block for
which join may happen, thus saving time spent in fetching right tuples over the
network which couldn't be joined in any case.
> Merge-Join phase 2
> Key: PIG-926
> URL: https://issues.apache.org/jira/browse/PIG-926
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Reporter: Ashutosh Chauhan
> Assignee: Ashutosh Chauhan
> Priority: Minor
> Attachments: mj_phase2_1.patch
> This jira is created to keep track of phase-2 work for MergeJoin. Various
> limitations exist in phase-1 for Merge Join which are listed on:
> http://wiki.apache.org/pig/PigMergeJoin Those will be addressed here.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.