[ 
https://issues.apache.org/jira/browse/TEZ-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13971862#comment-13971862
 ] 

Siddharth Seth commented on TEZ-873:
------------------------------------

[~kamrul], lets get this in to MRInputLegacy for now. Please annotate as 
Unstable. For simple cases where a user needs to access split information up 
front - and does not need to do anything when the underlying split changes, 
this is OK. However, if a user expects to change processing based on the reader 
moving to a new file - this is very inefficient. That is the Hive use case.

On the patch itself, it needs to be rebase.
- Please avoid changes which are not required for the patch - 
wrappedSplits/GroupedSplits in the grouping classes. usingNewApi in MRInput. 
The method name can be isUsingNewApi - but the internal parameter does not need 
to change. This will get rid of other unnecessary changes in Grouping as well.
- s/getGroupedSplits/getConstituentSplits in the Grouping classes.

Could you also add a note on these methods - something to the extent of, "Is 
not meant to be used inside of a reader tight loop and if the functionality is 
required, it should be a new jira"


> Allow MRInputLegacy to expose the individual input split
> --------------------------------------------------------
>
>                 Key: TEZ-873
>                 URL: https://issues.apache.org/jira/browse/TEZ-873
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Mohammad Kamrul Islam
>            Assignee: Mohammad Kamrul Islam
>         Attachments: TEZ-873.1.patch, TEZ-873.2.patch
>
>
> Currently there is no way of getting InputSplit from TezProcessor. In current 
> MR framework, there is  a way to find out the filename through FileSplit.  
> For example, one common uses is to get the filename in map
> String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
> There are other meta-data in Inputsplit that could be used by existing MR 
> user.
> This JIRA is to add APIs to expose the InputSplit by adding these   
> TezGroupedSplit.getWrapperSplit() and MRInput.getInputSplit().
> Although MRInputLegacy provide an API to get the InputSplit, it has few 
> issues:
>  * Without TezGroupedSplit.getWrapperSplit() it is unusable.
>  * Since it is used in various use cases, I propose to move it from 
> MRInputLegacy to MRInput.
> * Currently the APIs are named as getNewInputSplit() and getOldInputSplit().  
> These should be merged into one : getInputSplit(). The new/old API should be 
> handled internally.
> Please give your feedback.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to