Hi all, I am working with apache crunch 0.9.0 and hadoop yarn. I am doing a DoFn to read an Avro file and change some values of a Avro GenericRecord and I return it by the emitter object. After the DoFn Call I use the Pipeline to write the final collection as Avro into the HDFS.
My problem is that I am processing a lot of avro files of 2 or 3 gb each one, but for each processed file crunch is generating a big amount of mappers. For example for 2 files of 2.5 GB approximated, crunch generate 40 map tasks and finally the output are 40 files in the HDFS. My Code do something like that: DoFN process code: @Override public void process(Record record, Emitter<Record> emitter) { avroProtector.protect(record); emitter.emit(record); } MAIN CODE: // Initialize objects PCollection<Record> avroCreditRecords = pipeline.read(From.avroFile(avroFile, avroObject)); FnTokenizeCollection toTokenizedColl = new FnTokenizeCollection(tokenizerSchema.toString()); PCollection<Record> TokenizedData = avroCreditRecords.parallelDo(toTokenizedColl, Avros.generics(dataSchema)); //TokenizedData.write(To.avroFile(outputDir)); pipeline.write(TokenizedData, To.avroFile(outputDir)); PipelineResult result = pipeline.done(); return result.succeeded() ? 0 : 1; Can someone help me with that? Regardss Cristian Giha SepĂșlveda | Development engineer intermediate | Data & Analytic Team Office: +1 866 2 444 72 19 | cristian.g...@equifax.com<mailto:cristian.g...@equifax.com> Equifax Chile | Isidora Goyenechea 2800, Las Condes, Santiago.