Thejas M Nair commented on PIG-1088:

*Problem* : With old load/store interface, the index created by 
MergeJoinIndexer consisted of tuples with join key(s), filename, offset. With 
the new load/store interface, the split index is available 
(RecordReader.getSplitIndex) instead of filename and offset . But there is no 
guarantee that split indexes are in sorted order of the file. If more than one 
split has tuples with same join key in it, it is necessary to know which split 
needs to be read first. 

*Proposal*: (thanks to Alan Gates)
We should add an interface to the list of  load interfaces:

public interface LoadOrderedInput {

     WritableComparable getPosition();

If the load function implements this interface it can then be used in  a merge 
join.  This getPosition call could then be called in the map phase of the 
sampling MR job and the tuples in the index will have the sort(/join) key(s) 
followed by the resulting value. 
In sorting the index in the reduce phase of the sampling MR job, this value 
will then be used.

For LoadFuncs that use FileInputFormat,  getPosition can return the following 

public class TextInputOrder implements WritableComparable {

        private String basename;  // basename of the file
        private long offset;              // offset at which this split starts

         int compareTo(TextInputOrder other) {
                int rc = basename.compareTo(other.basename)
                if (rc == 0) rc = offset.compareTo(other.offset);
                return rc;

This means that we would take the filenames sorted lexigraphically  (which will 
work for things like part-00000, map-00000, bucket001 (warehouse data), etc.) 
and then offsets into those files after that.   
To make it easier for authors of new LoadFuncs to implement this interface, 
implementation of this interface for load functions that use FileInputFormat  
will be provided through an abstract base class. 

> change merge join and merge join indexer to work with new LoadFunc interface
> ----------------------------------------------------------------------------
>                 Key: PIG-1088
>                 URL: https://issues.apache.org/jira/browse/PIG-1088
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to