> You generally want: N * parquet.block.size = dfs.block.size, where N is a whole number.
Thanks, I will probably update my settings to use the same block size for both dfs and parquet. On Wed, Jan 27, 2016 at 4:25 PM, Ryan Blue <b...@cloudera.com> wrote: > You generally want: N * parquet.block.size = dfs.block.size, where N is a > whole number. Your settings look fine to me. The latest versions of Parquet > will use a HDFS block size at least as large as your row group size > (parquet.block.size) to avoid spanning blocks. I think we should also > default the padding (used only for block-based file systems) to something > like 4 or 8 MB, too, though that isn't in master yet. > > rb > > > On 01/27/2016 10:26 AM, Buntu Dev wrote: > >> Thanks Ryan. Does it mean that we need to set the parquet block size <= >> hdfs block size in order to avoid the row group to span HDFS block? I used >> a 256MB parquet block size with 128MB HDFS block size and got about a >> block >> per file in most cases. Are there any other tools I can use to ensure I >> got >> the right parquet settings? >> >> Thanks! >> >> On Wed, Jan 27, 2016 at 9:42 AM, Ryan Blue <b...@cloudera.com> wrote: >> >> Hi Buntu, >>> >>> Each Parquet row group (Parquet block) is the data to reconstruct a group >>> of rows. That means that they are the parts of a file you can process in >>> parallel. You can't divide and process in parallel any further. That >>> makes >>> them analogous to HDFS blocks, and you want to avoid having a row group >>> spanning HDFS blocks. (We also have a padding setting to ensure they >>> don't >>> span blocks.) >>> >>> There is no significant penalty for having multiple row groups in a >>> single >>> HDFS block (assuming they are still large), so you can fit several in if >>> you need to. You might need to in order to keep memory consumption down >>> because the row group size is the amount of data in memory that will be >>> buffered before flushing to disk. The read side can get away with a bit >>> less memory if you're ignoring columns, but in general the memory >>> consumption per open file is on the order of the row group size. >>> >>> Thanks for asking, >>> >>> rb >>> >>> >>> On 01/27/2016 12:20 AM, Buntu Dev wrote: >>> >>> I'm converting an existing avro data into parquet using Hive. I got the >>>> HDFS block size set to 128MB while there seem to be a Parquet block size >>>> setting that can be configured as well. Does the parquet block size need >>>> to >>>> be set to be the same as HDFS block size? >>>> >>>> Thanks! >>>> >>>> >>>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Cloudera, Inc. >>> >>> >> > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. >