It seems like you would be best off writing out N separate parquet files of the desired size. That seems better than having N files with one row group each and a shared footer that you have to stitch together to read. I guess there would be a small amount of redundancy between footer contents, but that wouldn't count for much in the scheme of things. If you have partial parquet files without a footer, you lose the self-describing/self-contained nature of Parquet files, like Jason said.
I guess I'm not sure if parquet-mr or whatever you're using to write parquet has an option to start a new file at each row group boundary, but that seems like it would probably solve your problem. On Wed, Dec 30, 2020 at 1:09 PM Jason Altekruse <[email protected]> wrote: > Hi Jayjeet, > > Is there a particular reason that you need to spread out data into > multiple small files? On HDFS at least there are longstanding > scalability issues with having lots of smaller files around, so there > generally is a move to concatenate together smaller files. Even with larger > files, the various common querying mechanisms, Spark, MapReduce, Hive, > Impala, etc. will all allow parallelizing reads by blocks, which when > configured properly should correspond to parquet row groups. > > The size of a row group is fixed by the setting of parquet.block.size. You > mentioned alignment, and pretty early on a padding feature was added to > parquet to ensure that row groups would try to end on the true HDFS block > boundaries, to avoid the need to read across blocks when accessing a row > group (because row groups have to contain full rows, it is unlikely you > will end with exactly the right number of bytes in the row group to match > the end of the HFS block). > > https://github.com/apache/parquet-mr/pull/211 > > So to your specific proposal, it currently isn't possible to detach the > footer from the file that contains the actual data in the row groups, but I > think that is a good property, it means everything for that data to be read > is fully contained in one file that can be moved/renamed safely. > > There are some systems that elect to write only a single row group per > file, because HDFS doesn't allow rewriting data in place. Doing this > enables use cases where individual rows need to be deleted or updated to be > accomplished by re-writing smaller files, instead of needing to read in and > write back out a large file containing many row groups when only a single > row group's data has changed. > > - Jason > > > On Wed, Dec 30, 2020 at 2:33 PM Jayjeet Chakraborty < > [email protected]> wrote: > > > Hi all, > > > > I am trying to figure out if a large Parquet file can be striped across > > multiple small files based on a Row group chunk size where each stripe > > would naturally end up containing data pages from a single row group. So, > > if I say my writer "write a parquet file in chunks of 128 MB (assuming my > > row groups are of around 128MB), each of my chunks ends up being > > self-contained row group, maybe except the last chunk which has the > footer > > contents. Is this possible? Can we fix the row group size (the amount of > > disk space a row group uses) while writing parquet files ? Thanks a lot. > > >
