On 04/13/2015 02:21 PM, Tianqi Tong wrote:
Hi Ryan,
Thanks for the reply!
The post was very useful to understand the the relationship between Parquet
block size and HDFS block size.
I'm currently migrating a RCFile table to a Parquet table. Right now I'm
partitioning by month and prefix of a column, and I have over 500k+ partitions
in total. Does it hurt performance if I have that many partitions?
Thank you!
Tianqi
I'm glad it was helpful.
There's not necessarily anything wrong with 500,000+ partitions, but I
don't think that data point alone is enough. Partitioning is always a
trade-off.
You want your partitions to have enough data in them to avoid the small
files problem. But, you also want the partitions them to be a good index
into the data so you can avoid reading as much as possible. In general,
I'd make sure partitions contain at least a few HDFS blocks worth of
data and make sure files within are at least a full HDFS block.
rb
--
Ryan Blue
Software Engineer
Cloudera, Inc.