Re: how to get output files of fixed size in map-reduce job output

Harsh J Wed, 22 Jun 2011 11:46:59 -0700

CombineFileInputFormat should help with doing some locality, but it
would not be as perfect as having the file loaded to the HDFS itself
with a 1 GB block size (block sizes are per file properties, not
global ones). You may consider that as an alternative approach.


I do not get (ii). I meant by my last sentence the same thing I've
explained just above here. If your block size is 64 MB, and your
request splits of 1 GB (via plain FileInputFormat), then even the 64
MB read can't be guaranteed local (theoretically speaking).

On Thu, Jun 23, 2011 at 12:04 AM, Mapred Learn <mapred.le...@gmail.com> wrote:
> Hi Harsh,
> Thanks !
> i) I was currently doing it by extending CombineFileInputFormat and
> specifying -Dmapred.max.split.size but this increases job finish time by
> about 3 times.
> ii) since you said this file output size is going to be greater than block
> size in this case. What happens in case when people have input split of say
> 1 Gb and map-red output is produced as 400 MB. In this case also, size is
> greater than block size ? Or did you mean that since mapper will get
> multiple input files as input split, the data input to mapper won't be local
> ?
>
> On Wed, Jun 22, 2011 at 11:26 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Mapred,
>>
>> This should be doable if you are using TextInputFormat (or other
>> FileInputFormat derivatives that do not override getSplits()
>> behaviors).
>>
>> Try this:
>> jobConf.setLong("mapred.min.split.size", <byte size you want each
>> mapper split to try to contain, i.e. 1 GB in bytes (long)>);
>>
>> This would get you splits worth the size you mention, 1 GB or else,
>> and you should have outputs fairly near to 1 GB when you do the
>> sequence file conversion (lower at times due to serialization and
>> compression being applied). You can play around with the parameter
>> until the results are satisfactory.
>>
>> Note: Tasks would no longer be perfectly data local since you're
>> requesting much > block size perhaps.
>>
>> On Wed, Jun 22, 2011 at 10:52 PM, Mapred Learn <mapred.le...@gmail.com>
>> wrote:
>> > I have a use case where I want to process data and generate seq file
>> > output
>> > of fixed size , say 1 GB i.e. each map-reduce job output should be 1 Gb.
>> >
>> > Does anybody know of any -D option or any other way to achieve this ?
>> >
>> > -Thanks JJ
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: how to get output files of fixed size in map-reduce job output

Reply via email to