Hi Harsh, Thanks ! i) I was currently doing it by extending CombineFileInputFormat and specifying -Dmapred.max.split.size but this increases job finish time by about 3 times. ii) since you said this file output size is going to be greater than block size in this case. What happens in case when people have input split of say 1 Gb and map-red output is produced as 400 MB. In this case also, size is greater than block size ? Or did you mean that since mapper will get multiple input files as input split, the data input to mapper won't be local ?
On Wed, Jun 22, 2011 at 11:26 AM, Harsh J <ha...@cloudera.com> wrote: > Mapred, > > This should be doable if you are using TextInputFormat (or other > FileInputFormat derivatives that do not override getSplits() > behaviors). > > Try this: > jobConf.setLong("mapred.min.split.size", <byte size you want each > mapper split to try to contain, i.e. 1 GB in bytes (long)>); > > This would get you splits worth the size you mention, 1 GB or else, > and you should have outputs fairly near to 1 GB when you do the > sequence file conversion (lower at times due to serialization and > compression being applied). You can play around with the parameter > until the results are satisfactory. > > Note: Tasks would no longer be perfectly data local since you're > requesting much > block size perhaps. > > On Wed, Jun 22, 2011 at 10:52 PM, Mapred Learn <mapred.le...@gmail.com> > wrote: > > I have a use case where I want to process data and generate seq file > output > > of fixed size , say 1 GB i.e. each map-reduce job output should be 1 Gb. > > > > Does anybody know of any -D option or any other way to achieve this ? > > > > -Thanks JJ > > > > -- > Harsh J >