Re: HDFS vs Parquet block size

Buntu Dev Wed, 27 Jan 2016 17:03:08 -0800

> You generally want: N * parquet.block.size = dfs.block.size, where N is a
whole number.


Thanks, I will probably update my settings to use the same block size for
both dfs and parquet.

On Wed, Jan 27, 2016 at 4:25 PM, Ryan Blue <b...@cloudera.com> wrote:

> You generally want: N * parquet.block.size = dfs.block.size, where N is a
> whole number. Your settings look fine to me. The latest versions of Parquet
> will use a HDFS block size at least as large as your row group size
> (parquet.block.size) to avoid spanning blocks. I think we should also
> default the padding (used only for block-based file systems) to something
> like 4 or 8 MB, too, though that isn't in master yet.
>
> rb
>
>
> On 01/27/2016 10:26 AM, Buntu Dev wrote:
>
>> Thanks Ryan. Does it mean that we need to set the parquet block size <=
>> hdfs block size in order to avoid the row group to span HDFS block? I used
>> a 256MB parquet block size with 128MB HDFS block size and got about a
>> block
>> per file in most cases. Are there any other tools I can use to ensure I
>> got
>> the right parquet settings?
>>
>> Thanks!
>>
>> On Wed, Jan 27, 2016 at 9:42 AM, Ryan Blue <b...@cloudera.com> wrote:
>>
>> Hi Buntu,
>>>
>>> Each Parquet row group (Parquet block) is the data to reconstruct a group
>>> of rows. That means that they are the parts of a file you can process in
>>> parallel. You can't divide and process in parallel any further. That
>>> makes
>>> them analogous to HDFS blocks, and you want to avoid having a row group
>>> spanning HDFS blocks. (We also have a padding setting to ensure they
>>> don't
>>> span blocks.)
>>>
>>> There is no significant penalty for having multiple row groups in a
>>> single
>>> HDFS block (assuming they are still large), so you can fit several in if
>>> you need to. You might need to in order to keep memory consumption down
>>> because the row group size is the amount of data in memory that will be
>>> buffered before flushing to disk. The read side can get away with a bit
>>> less memory if you're ignoring columns, but in general the memory
>>> consumption per open file is on the order of the row group size.
>>>
>>> Thanks for asking,
>>>
>>> rb
>>>
>>>
>>> On 01/27/2016 12:20 AM, Buntu Dev wrote:
>>>
>>> I'm converting an existing avro data into parquet using Hive. I got the
>>>> HDFS block size set to 128MB while there seem to be a Parquet block size
>>>> setting that can be configured as well. Does the parquet block size need
>>>> to
>>>> be set to be the same as HDFS block size?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Cloudera, Inc.
>>>
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: HDFS vs Parquet block size

Reply via email to