Re: Controlling Avro Output file number

Gabriel Reid Tue, 11 Nov 2014 11:31:26 -0800

Hi Cristian,

Is your main concern about the number of output files that you're
getting, or the fact that 40 mappers are being started up to process 5
GB of data?


If you're more worried about the number of mappers, this is controlled
by the number of input splits, which is (by default) controlled by the
number of HDFS blocks in the files that you're processing. MapReduce
will start up one mapper per HDFS block of data by default, so
assuming that your block size is 128 MB, that works out to around 40
mappers (i.e. 5 GB / 128 MB = 39.06).

The two options for reducing the number of mappers being run are
* have bigger block sizes on HDFS
* set the mapred.min.split.size setting in your configuration to
something larger than 128 MB

As this is down to the underlying MapReduce libraries, this wiki page
on the Hadoop wiki may also be helpful:
http://wiki.apache.org/hadoop/HowManyMapsAndReduces

- Gabriel


On Tue, Nov 11, 2014 at 7:09 PM, Cristian Giha
<cristian.g...@equifax.com> wrote:
> Hi all,
>
> I am working with apache crunch 0.9.0 and hadoop yarn.
> I am doing a DoFn to read an Avro file and change some values of a Avro 
> GenericRecord and I return it by the emitter object.
> After the DoFn Call I use the Pipeline to write the final collection as Avro 
> into the HDFS.
>
> My problem is that I am processing a lot of avro files of 2 or 3 gb each one, 
> but for each processed file crunch is generating a big amount of mappers.
> For example for 2 files of 2.5 GB approximated, crunch generate 40 map tasks 
> and finally the output are 40 files in the HDFS.
>
> My Code do something like that:
>
> DoFN process code:
>
>                 @Override
>                 public void process(Record record, Emitter<Record> emitter)
>                 {
>                                 avroProtector.protect(record);
>                                 emitter.emit(record);
>                 }
>
> MAIN CODE:
>                                 // Initialize objects
>                                 PCollection<Record> avroCreditRecords = 
> pipeline.read(From.avroFile(avroFile, avroObject));
>
>                                 FnTokenizeCollection toTokenizedColl = new 
> FnTokenizeCollection(tokenizerSchema.toString());
>
>                                 PCollection<Record> TokenizedData = 
> avroCreditRecords.parallelDo(toTokenizedColl, Avros.generics(dataSchema));
>
>                                 //TokenizedData.write(To.avroFile(outputDir));
>
>                                 pipeline.write(TokenizedData, 
> To.avroFile(outputDir));
>
>                                 PipelineResult result = pipeline.done();
>
>                                 return result.succeeded() ? 0 : 1;
>
>
>
> Can someone help me with that?
> Regardss
>
> Cristian Giha Sepúlveda | Development engineer intermediate | Data & Analytic 
> Team
> Office: +1 866 2  444 72 19 | 
> cristian.g...@equifax.com<mailto:cristian.g...@equifax.com>
> Equifax Chile | Isidora Goyenechea 2800, Las Condes, Santiago.
>
>

Re: Controlling Avro Output file number

Reply via email to