[
https://issues.apache.org/jira/browse/PIG-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776792#action_12776792
]
Thejas M Nair commented on PIG-1088:
------------------------------------
*Problem* : With old load/store interface, the index created by
MergeJoinIndexer consisted of tuples with join key(s), filename, offset. With
the new load/store interface, the split index is available
(RecordReader.getSplitIndex) instead of filename and offset . But there is no
guarantee that split indexes are in sorted order of the file. If more than one
split has tuples with same join key in it, it is necessary to know which split
needs to be read first.
*Proposal*: (thanks to Alan Gates)
We should add an interface to the list of load interfaces:
public interface LoadOrderedInput {
WritableComparable getPosition();
}
If the load function implements this interface it can then be used in a merge
join. This getPosition call could then be called in the map phase of the
sampling MR job and the tuples in the index will have the sort(/join) key(s)
followed by the resulting value.
In sorting the index in the reduce phase of the sampling MR job, this value
will then be used.
For LoadFuncs that use FileInputFormat, getPosition can return the following
class:
public class TextInputOrder implements WritableComparable {
private String basename; // basename of the file
private long offset; // offset at which this split starts
int compareTo(TextInputOrder other) {
int rc = basename.compareTo(other.basename)
if (rc == 0) rc = offset.compareTo(other.offset);
return rc;
}
}
This means that we would take the filenames sorted lexigraphically (which will
work for things like part-00000, map-00000, bucket001 (warehouse data), etc.)
and then offsets into those files after that.
To make it easier for authors of new LoadFuncs to implement this interface,
implementation of this interface for load functions that use FileInputFormat
will be provided through an abstract base class.
> change merge join and merge join indexer to work with new LoadFunc interface
> ----------------------------------------------------------------------------
>
> Key: PIG-1088
> URL: https://issues.apache.org/jira/browse/PIG-1088
> Project: Pig
> Issue Type: Sub-task
> Reporter: Thejas M Nair
> Assignee: Thejas M Nair
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.