Hi Jayjeet, I assume you are using parquet-mr (and not other parquet implementations like parquet-cpp, Impala etc.).
I am not sure if I got your request correctly. You may configure the size of the row group by setting the config parquet.block.size. You may also check parquet.writer.max-padding so the row groups will fit exactly into the blocks. See details about the available configs at https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md. Currently, parquet-mr does not have the functionality to automatically close a parquet file and start a new one during writing. Regards, Gabor On Thu, Dec 31, 2020 at 5:58 AM Jayjeet Chakraborty < [email protected]> wrote: > Hi all, > > I am trying to figure out if a large Parquet file can be striped across > multiple small files based on a Row group chunk size where each stripe > would naturally end up containing data pages from a single row group. So, > if I say my writer "write a parquet file in chunks of 128 MB (assuming my > row groups are of around 128MB), each of my chunks ends up being > self-contained row group, maybe except the last chunk which has the footer > contents. Is this possible? Can we fix the row group size (the amount of > disk space a row group uses) while writing parquet files ? Thanks a lot. >
