On 02/23/2015 06:24 PM, Chandan Biswas wrote:
Hello All,
I am new to Parquet. Please forgive me if the question is repetitive. I was
considering to use Parquet for a project. So I was playing with it. The
simple thing I did was reading avro objects from hdfs and writing it back
to hdfs in parquet file format. I used Crunch pipeline for it. One thing I
noticed that it's require more heap to run pipeline. When I was not using
parquet fileformat, memory settings was 2gb heap and 4gb virtual. When I
switched to parquet fileformat, required memory settings to run the
pipeline is 8gb heap and 10gb virtual.If I give less memory the task throws
heap error. I haven't changed any settings.  I was using complex multilevel
nested avro object. And the total no records were 150k. Here is the code
snippet-

         final Pipeline pipeline = new MRPipeline(ParquetTest.class,
"ParquetTestPipeline", config);

         final PCollection<Person> persons = pipeline.read(From.avroFile(new
Path("..hadfs source Path..."),
                 Avros.records(Person.class)));

         final AvroParquetFileTarget parquetFileTarget = new
AvroParquetFileTarget("..hadfs target Path...");
         pipeline.write(persons, parquetFileTarget);

         pipeline.done();

I was using parquet version - 1.4.1

My question is - why is it taking more memory to run pipeline using parquet
fileformat?. Is it because of creating row group requires all records into
heap? Or Am I doing something wrong?

Thanks,
*Chandan Biswas*


Hi Chandan,

You're right: the added memory consumption is caused by buffering records up to the row group size. I recently wrote a blog post explaining this:

  http://ingest.tips/2015/01/31/parquet-row-group-size/

You should try to write data from your final PCollection so that each reducer produces one file (or one file at a time). You can see an example of "repartitioning" your PCollection here:


https://github.com/kite-sdk/kite/blob/master/kite-data/kite-data-crunch/src/main/java/org/kitesdk/data/crunch/CrunchDatasets.java#L196

rb

--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to