Hi Jayjeet, Is there a particular reason that you need to spread out data into multiple small files? On HDFS at least there are longstanding scalability issues with having lots of smaller files around, so there generally is a move to concatenate together smaller files. Even with larger files, the various common querying mechanisms, Spark, MapReduce, Hive, Impala, etc. will all allow parallelizing reads by blocks, which when configured properly should correspond to parquet row groups.
The size of a row group is fixed by the setting of parquet.block.size. You mentioned alignment, and pretty early on a padding feature was added to parquet to ensure that row groups would try to end on the true HDFS block boundaries, to avoid the need to read across blocks when accessing a row group (because row groups have to contain full rows, it is unlikely you will end with exactly the right number of bytes in the row group to match the end of the HFS block). https://github.com/apache/parquet-mr/pull/211 So to your specific proposal, it currently isn't possible to detach the footer from the file that contains the actual data in the row groups, but I think that is a good property, it means everything for that data to be read is fully contained in one file that can be moved/renamed safely. There are some systems that elect to write only a single row group per file, because HDFS doesn't allow rewriting data in place. Doing this enables use cases where individual rows need to be deleted or updated to be accomplished by re-writing smaller files, instead of needing to read in and write back out a large file containing many row groups when only a single row group's data has changed. - Jason On Wed, Dec 30, 2020 at 2:33 PM Jayjeet Chakraborty < [email protected]> wrote: > Hi all, > > I am trying to figure out if a large Parquet file can be striped across > multiple small files based on a Row group chunk size where each stripe > would naturally end up containing data pages from a single row group. So, > if I say my writer "write a parquet file in chunks of 128 MB (assuming my > row groups are of around 128MB), each of my chunks ends up being > self-contained row group, maybe except the last chunk which has the footer > contents. Is this possible? Can we fix the row group size (the amount of > disk space a row group uses) while writing parquet files ? Thanks a lot. >
