[
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898648#action_12898648
]
Ashutosh Chauhan commented on PIG-1518:
---------------------------------------
This feature of combining multiple splits should honor OrderedLoadFunc
interface. If loadfunc is implementing that interface, then splits generated by
it should not be combined. However, its not clear why FileInputLoadFunc
implements this interface. AFAIK, split[] returned by getsplits() on
FileInputFormat makes no guarantees that underlying splits will be returned in
ordered fashion. Though, it is a default behavior right now and thus making it
implement OrderedLoadFunc doesnt result in any problem in current
implementation. But it seems there is no real benefit of FileInputLoadFunc
needing to implement it (there is one exception to which I will come later on).
So, I will argue that FileInputLoadFunc stop implementing OrderedLoadFunc. This
will result in immediate benefit of making this change useful to all the
fundamental storage mechanisms of Pig like PigStorage, BinStorage, InterStorage
etc. Dropping of an interface by an implementing class can be seen as backward
incompatible change, but I really doubt if any one cares if PigStorage is
reading splits in an ordered fashion.
Only real victim of this change will be MergeJoin which will stop working with
PigStorage by default. But we have not seen MergeJoin being used with
PigStorage at many places. Second, its anyway is based on assumption of
FileInputFormat which may choose to change behavior in future. Third, solution
of this problem will be straight forward that having other Loader which extends
PigStorage and implements OrderedLoadFunc which can be used to load data for
merge join.
In essence I am arguing to drop OrderedLoadFunc interface from
FileInputLoadFunc so that this feature is useful for large number of usecases.
Yan, you also need to watch out for ReadToEndLoader which is also making
assumptions which may break in presence of this feature.
> multi file input format for loaders
> -----------------------------------
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
> Issue Type: Improvement
> Reporter: Olga Natkovich
> Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files
> in the input. In this case a separate map is created for each file which
> could be very inefficient.
> It would be greate to have an umbrella input format that can take multiple
> files and use them in a single split. We would like to see this working with
> different data formats if possible.
> There are already a couple of input formats doing similar thing:
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works
> with ne Hadoop 20 API.
> We at least want to do a feasibility study for Pig 0.8.0.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.