On 03/24/2015 07:20 AM, Stephen Carman wrote:
Hello,
I'm looking for guidance on tuning parquet's memory usage as well as how it
generates partitions of data. Can anyone point me in the correct direction
on how to tune these or to specifically have programmatic methods of
generating partitions.
Thanks,
Steve Carman
Hi Steve,
I recently wrote a blog post on the Parquet row group size with the
basics. It's here:
http://ingest.tips/2015/01/31/parquet-row-group-size/
For partitioning, that's mostly outside the scope of the format itself
because it requires you to separate data into partitions in your processing.
There are a couple of off-the-shelf ways to partition your data, the
most popular is Hive where you specify in the SQL-like language how to
derive partition values in your insert statements. Another option you
can use is Kite (kitesdk.org), which will partition the data for you
based on a config file. Kite is a library you can include in your
application.
rb
(By the way, I work on Kite as well)
--
Ryan Blue
Software Engineer
Cloudera, Inc.