It seems like you would be best off writing out N separate parquet files of
the desired size. That seems better than having N files with one row group
each and a shared footer that you have to stitch together to read. I guess
there would be a small amount of redundancy between footer contents, but
that wouldn't count for much in the scheme of things. If you have partial
parquet files without a footer, you lose the self-describing/self-contained
nature of Parquet files, like Jason said.

I guess I'm not sure if parquet-mr or whatever you're using to write
parquet has an option to start a new file at each row group boundary, but
that seems like it would probably solve your problem.





On Wed, Dec 30, 2020 at 1:09 PM Jason Altekruse <[email protected]>
wrote:

> Hi Jayjeet,
>
> Is there a particular reason that you need to spread out data into
> multiple small files? On HDFS at least there are longstanding
> scalability issues with having lots of smaller files around, so there
> generally is a move to concatenate together smaller files. Even with larger
> files, the various common querying mechanisms, Spark, MapReduce, Hive,
> Impala, etc. will all allow parallelizing reads by blocks, which when
> configured properly should correspond to parquet row groups.
>
> The size of a row group is fixed by the setting of parquet.block.size. You
> mentioned alignment, and pretty early on a padding feature was added to
> parquet to ensure that row groups would try to end on the true HDFS block
> boundaries, to avoid the need to read across blocks when accessing a row
> group (because row groups have to contain full rows, it is unlikely you
> will end with exactly the right number of bytes in the row group to match
> the end of the HFS block).
>
> https://github.com/apache/parquet-mr/pull/211
>
> So to your specific proposal, it currently isn't possible to detach the
> footer from the file that contains the actual data in the row groups, but I
> think that is a good property, it means everything for that data to be read
> is fully contained in one file that can be moved/renamed safely.
>
> There are some systems that elect to write only a single row group per
> file, because HDFS doesn't allow rewriting data in place. Doing this
> enables use cases where individual rows need to be deleted or updated to be
> accomplished by re-writing smaller files, instead of needing to read in and
> write back out a large file containing many row groups when only a single
> row group's data has changed.
>
> - Jason
>
>
> On Wed, Dec 30, 2020 at 2:33 PM Jayjeet Chakraborty <
> [email protected]> wrote:
>
> > Hi all,
> >
> > I am trying to figure out if a large Parquet file can be striped across
> > multiple small files based on a Row group chunk size where each stripe
> > would naturally end up containing data pages from a single row group. So,
> > if I say my writer "write a parquet file in chunks of 128 MB (assuming my
> > row groups are of around 128MB), each of my chunks ends up being
> > self-contained row group, maybe except the last chunk which has the
> footer
> > contents. Is this possible? Can we fix the row group size (the amount of
> > disk space a row group uses) while writing parquet files ? Thanks a lot.
> >
>

Reply via email to