On 03/24/2015 07:20 AM, Stephen Carman wrote:
Hello,

I'm looking for guidance on tuning parquet's memory usage as well as how it
generates partitions of data. Can anyone point me in the correct direction
on how to tune these or to specifically have programmatic methods of
generating partitions.

Thanks,
Steve Carman


Hi Steve,

I recently wrote a blog post on the Parquet row group size with the basics. It's here:

  http://ingest.tips/2015/01/31/parquet-row-group-size/

For partitioning, that's mostly outside the scope of the format itself because it requires you to separate data into partitions in your processing.

There are a couple of off-the-shelf ways to partition your data, the most popular is Hive where you specify in the SQL-like language how to derive partition values in your insert statements. Another option you can use is Kite (kitesdk.org), which will partition the data for you based on a config file. Kite is a library you can include in your application.

rb

(By the way, I work on Kite as well)

--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to