Hi all,

I am working with apache crunch 0.9.0 and hadoop yarn.
I am doing a DoFn to read an Avro file and change some values of a Avro 
GenericRecord and I return it by the emitter object.
After the DoFn Call I use the Pipeline to write the final collection as Avro 
into the HDFS.

My problem is that I am processing a lot of avro files of 2 or 3 gb each one, 
but for each processed file crunch is generating a big amount of mappers.
For example for 2 files of 2.5 GB approximated, crunch generate 40 map tasks 
and finally the output are 40 files in the HDFS.

My Code do something like that:

DoFN process code:

                @Override
                public void process(Record record, Emitter<Record> emitter)
                {
                                avroProtector.protect(record);
                                emitter.emit(record);
                }

MAIN CODE:
                                // Initialize objects
                                PCollection<Record> avroCreditRecords = 
pipeline.read(From.avroFile(avroFile, avroObject));

                                FnTokenizeCollection toTokenizedColl = new 
FnTokenizeCollection(tokenizerSchema.toString());

                                PCollection<Record> TokenizedData = 
avroCreditRecords.parallelDo(toTokenizedColl, Avros.generics(dataSchema));

                                //TokenizedData.write(To.avroFile(outputDir));

                                pipeline.write(TokenizedData, 
To.avroFile(outputDir));

                                PipelineResult result = pipeline.done();

                                return result.succeeded() ? 0 : 1;



Can someone help me with that?
Regardss

Cristian Giha SepĂșlveda | Development engineer intermediate | Data & Analytic 
Team
Office: +1 866 2  444 72 19 | 
cristian.g...@equifax.com<mailto:cristian.g...@equifax.com>
Equifax Chile | Isidora Goyenechea 2800, Las Condes, Santiago.


Reply via email to