problem with first option is that even if file is uploaded as 1 GB, then also output is not 1 GB (it wud depend on compression). So, some runs need to be done to estimate what size input file should be uploaded as to get 1 GB output.
For block size, I got your point. I think I said the same thing in terms of file splits. On Wed, Jun 22, 2011 at 11:46 AM, Harsh J <ha...@cloudera.com> wrote: > CombineFileInputFormat should help with doing some locality, but it > would not be as perfect as having the file loaded to the HDFS itself > with a 1 GB block size (block sizes are per file properties, not > global ones). You may consider that as an alternative approach. > > I do not get (ii). I meant by my last sentence the same thing I've > explained just above here. If your block size is 64 MB, and your > request splits of 1 GB (via plain FileInputFormat), then even the 64 > MB read can't be guaranteed local (theoretically speaking). > > On Thu, Jun 23, 2011 at 12:04 AM, Mapred Learn <mapred.le...@gmail.com> > wrote: > > Hi Harsh, > > Thanks ! > > i) I was currently doing it by extending CombineFileInputFormat and > > specifying -Dmapred.max.split.size but this increases job finish time by > > about 3 times. > > ii) since you said this file output size is going to be greater than > block > > size in this case. What happens in case when people have input split of > say > > 1 Gb and map-red output is produced as 400 MB. In this case also, size is > > greater than block size ? Or did you mean that since mapper will get > > multiple input files as input split, the data input to mapper won't be > local > > ? > > > > On Wed, Jun 22, 2011 at 11:26 AM, Harsh J <ha...@cloudera.com> wrote: > >> > >> Mapred, > >> > >> This should be doable if you are using TextInputFormat (or other > >> FileInputFormat derivatives that do not override getSplits() > >> behaviors). > >> > >> Try this: > >> jobConf.setLong("mapred.min.split.size", <byte size you want each > >> mapper split to try to contain, i.e. 1 GB in bytes (long)>); > >> > >> This would get you splits worth the size you mention, 1 GB or else, > >> and you should have outputs fairly near to 1 GB when you do the > >> sequence file conversion (lower at times due to serialization and > >> compression being applied). You can play around with the parameter > >> until the results are satisfactory. > >> > >> Note: Tasks would no longer be perfectly data local since you're > >> requesting much > block size perhaps. > >> > >> On Wed, Jun 22, 2011 at 10:52 PM, Mapred Learn <mapred.le...@gmail.com> > >> wrote: > >> > I have a use case where I want to process data and generate seq file > >> > output > >> > of fixed size , say 1 GB i.e. each map-reduce job output should be 1 > Gb. > >> > > >> > Does anybody know of any -D option or any other way to achieve this ? > >> > > >> > -Thanks JJ > >> > >> > >> > >> -- > >> Harsh J > > > > > > > > -- > Harsh J >