Ashutosh Chauhan updated PIG-926:

    Attachment: mj_phase2_1.patch

The attached first patch runs the full pipeline of right side in indexer before 
sampling the tuple from block. This has following advantages:
a) It addresses the concern which Pradeep pointed out in phase-1: "Strictly we 
should not allow LOForeach since it could change sort order or position of join 
keys and hence invalidate the index - but we need it so that the Foreach 
introduced by the TypeCastInserter when there is a schema for either of the 
inputs remains." Now since pipeline is run before sampling the tuple, this 
becomes a non-issue.
b) Currently type information doesn't make it to the POSort which sorts the 
index entries in reduce task of index job. This works due to other reasons, but 
this patch fixes this.
c) It will improve on performance. Instead of always sampling the first record 
of the block, index now contains the entry of first record in the block for 
which join may happen, thus saving time spent in fetching right tuples over the 
network which couldn't be joined in any case.

> Merge-Join phase 2
> ------------------
>                 Key: PIG-926
>                 URL: https://issues.apache.org/jira/browse/PIG-926
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Ashutosh Chauhan
>            Assignee: Ashutosh Chauhan
>            Priority: Minor
>         Attachments: mj_phase2_1.patch
> This jira is created to keep track of phase-2 work for MergeJoin. Various 
> limitations exist in phase-1 for Merge Join which are listed on: 
> http://wiki.apache.org/pig/PigMergeJoin Those will be addressed here.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to