[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jens Rabe updated MAPREDUCE-6208:
---------------------------------
    Attachment: MAPREDUCE-6208.001.patch

> There should be an input format for MapFiles which can be configured so that 
> only a fraction of the input data is used for the MR process
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6208
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6208
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Jens Rabe
>              Labels: inputformat, mapfile
>         Attachments: MAPREDUCE-6208.001.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In some cases there are large amounts of data organized in MapFiles, e.g., 
> from previous MapReduce tasks, and only a fraction of the data is to be 
> processed in a MR task. The current approach, as I understand, is to 
> re-organize the data in a suitable partition using folders on HDFS, and only 
> use relevant folders as input paths, and maybe doing some additional 
> filtering in the Map task. However, sometimes the input data cannot be easily 
> partitioned that way. For example, when processing large amounts of measured 
> data where additional data on a time period already in HDFS arrives later.
> There should be an input format that accepts folders with MapFiles, and there 
> should be an option to specify the input key range so that only fitting 
> InputSplits are generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to