Hi Ryan, Then back to the original topic: it should be okay if I break a Parquet file into multiple HDFS blocks, right? Because when I was querying via Impala, there's a warning like: Parquet file should not be split into multiple hdfs-blocks.
Thanks! Tianqi -----Original Message----- From: Ryan Blue [mailto:[email protected]] Sent: Monday, April 13, 2015 2:39 PM To: [email protected] Subject: Re: PARQUET_FILE_SIZE & parquet.block.size & dfs.blocksize On 04/13/2015 02:21 PM, Tianqi Tong wrote: > Hi Ryan, > Thanks for the reply! > The post was very useful to understand the the relationship between Parquet > block size and HDFS block size. > I'm currently migrating a RCFile table to a Parquet table. Right now I'm > partitioning by month and prefix of a column, and I have over 500k+ > partitions in total. Does it hurt performance if I have that many partitions? > > Thank you! > Tianqi I'm glad it was helpful. There's not necessarily anything wrong with 500,000+ partitions, but I don't think that data point alone is enough. Partitioning is always a trade-off. You want your partitions to have enough data in them to avoid the small files problem. But, you also want the partitions them to be a good index into the data so you can avoid reading as much as possible. In general, I'd make sure partitions contain at least a few HDFS blocks worth of data and make sure files within are at least a full HDFS block. rb -- Ryan Blue Software Engineer Cloudera, Inc.
