Hi Cristian, Is your main concern about the number of output files that you're getting, or the fact that 40 mappers are being started up to process 5 GB of data?
If you're more worried about the number of mappers, this is controlled by the number of input splits, which is (by default) controlled by the number of HDFS blocks in the files that you're processing. MapReduce will start up one mapper per HDFS block of data by default, so assuming that your block size is 128 MB, that works out to around 40 mappers (i.e. 5 GB / 128 MB = 39.06). The two options for reducing the number of mappers being run are * have bigger block sizes on HDFS * set the mapred.min.split.size setting in your configuration to something larger than 128 MB As this is down to the underlying MapReduce libraries, this wiki page on the Hadoop wiki may also be helpful: http://wiki.apache.org/hadoop/HowManyMapsAndReduces - Gabriel On Tue, Nov 11, 2014 at 7:09 PM, Cristian Giha <cristian.g...@equifax.com> wrote: > Hi all, > > I am working with apache crunch 0.9.0 and hadoop yarn. > I am doing a DoFn to read an Avro file and change some values of a Avro > GenericRecord and I return it by the emitter object. > After the DoFn Call I use the Pipeline to write the final collection as Avro > into the HDFS. > > My problem is that I am processing a lot of avro files of 2 or 3 gb each one, > but for each processed file crunch is generating a big amount of mappers. > For example for 2 files of 2.5 GB approximated, crunch generate 40 map tasks > and finally the output are 40 files in the HDFS. > > My Code do something like that: > > DoFN process code: > > @Override > public void process(Record record, Emitter<Record> emitter) > { > avroProtector.protect(record); > emitter.emit(record); > } > > MAIN CODE: > // Initialize objects > PCollection<Record> avroCreditRecords = > pipeline.read(From.avroFile(avroFile, avroObject)); > > FnTokenizeCollection toTokenizedColl = new > FnTokenizeCollection(tokenizerSchema.toString()); > > PCollection<Record> TokenizedData = > avroCreditRecords.parallelDo(toTokenizedColl, Avros.generics(dataSchema)); > > //TokenizedData.write(To.avroFile(outputDir)); > > pipeline.write(TokenizedData, > To.avroFile(outputDir)); > > PipelineResult result = pipeline.done(); > > return result.succeeded() ? 0 : 1; > > > > Can someone help me with that? > Regardss > > Cristian Giha SepĂșlveda | Development engineer intermediate | Data & Analytic > Team > Office: +1 866 2 444 72 19 | > cristian.g...@equifax.com<mailto:cristian.g...@equifax.com> > Equifax Chile | Isidora Goyenechea 2800, Las Condes, Santiago. > >