Jens Rabe created MAPREDUCE-6208:
------------------------------------

             Summary: There should be an input format for MapFiles which can be 
configured so that only a fraction of the input data is used for the MR process
                 Key: MAPREDUCE-6208
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6208
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
            Reporter: Jens Rabe


In some cases there are large amounts of data organized in MapFiles, e.g., from 
previous MapReduce tasks, and only a fraction of the data is to be processed in 
a MR task. The current approach, as I understand, is to re-organize the data in 
a suitable partition using folders on HDFS, and only use relevant folders as 
input paths, and maybe doing some additional filtering in the Map task. 
However, sometimes the input data cannot be easily partitioned that way. For 
example, when processing large amounts of measured data where additional data 
on a time period already in HDFS arrives later.

There should be an input format that accepts folders with MapFiles, and there 
should be an option to specify the input key range so that only fitting 
InputSplits are generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to