Re: Query on striping parquet files while maintaining Row group alignment

Jason Altekruse Wed, 30 Dec 2020 13:09:28 -0800

Hi Jayjeet,

Is there a particular reason that you need to spread out data into
multiple small files? On HDFS at least there are longstanding
scalability issues with having lots of smaller files around, so there
generally is a move to concatenate together smaller files. Even with larger
files, the various common querying mechanisms, Spark, MapReduce, Hive,
Impala, etc. will all allow parallelizing reads by blocks, which when
configured properly should correspond to parquet row groups.

The size of a row group is fixed by the setting of parquet.block.size. You
mentioned alignment, and pretty early on a padding feature was added to
parquet to ensure that row groups would try to end on the true HDFS block
boundaries, to avoid the need to read across blocks when accessing a row
group (because row groups have to contain full rows, it is unlikely you
will end with exactly the right number of bytes in the row group to match
the end of the HFS block).

https://github.com/apache/parquet-mr/pull/211

So to your specific proposal, it currently isn't possible to detach the
footer from the file that contains the actual data in the row groups, but I
think that is a good property, it means everything for that data to be read
is fully contained in one file that can be moved/renamed safely.

There are some systems that elect to write only a single row group per
file, because HDFS doesn't allow rewriting data in place. Doing this
enables use cases where individual rows need to be deleted or updated to be
accomplished by re-writing smaller files, instead of needing to read in and
write back out a large file containing many row groups when only a single
row group's data has changed.

- Jason

On Wed, Dec 30, 2020 at 2:33 PM Jayjeet Chakraborty <
[email protected]> wrote:

> Hi all,
>
> I am trying to figure out if a large Parquet file can be striped across
> multiple small files based on a Row group chunk size where each stripe
> would naturally end up containing data pages from a single row group. So,
> if I say my writer "write a parquet file in chunks of 128 MB (assuming my
> row groups are of around 128MB), each of my chunks ends up being
> self-contained row group, maybe except the last chunk which has the footer
> contents. Is this possible? Can we fix the row group size (the amount of
> disk space a row group uses) while writing parquet files ? Thanks a lot.
>

Re: Query on striping parquet files while maintaining Row group alignment

Reply via email to