On Thu, 29 Aug 2024 15:53:36 +0100 Raphael Taylor-Davies <r.taylordav...@googlemail.com.INVALID> wrote: > > IIRC some writers (perhaps parquet-rs?) always write a single row > FWIW parquet-rs will write multiple row groups depending on the > configuration. The defaults will write row groups of up to 1M rows [1].
Oops, sorry for misremembering! Best regards Antoine. > Perhaps you might be thinking of parquet-cpp which for a very long time > had very high defaults, leading it to often just create a single massive > row group [2]? I believe this was a bug and has been fixed. > > Kind Regards, > > Raphael > > [1]: > https://docs.rs/parquet/latest/parquet/file/properties/constant.DEFAULT_MAX_ROW_GROUP_SIZE.html > [2]: https://github.com/apache/arrow/pull/36012 > > On 29/08/2024 15:11, Antoine Pitrou wrote: > > On Thu, 29 Aug 2024 12:33:25 +0200 > > Alkis Evlogimenos > > <alkis.evlogime...@databricks.com.INVALID> > > wrote: > >> The simplest fix for a writer is to limit row groups to 2^31 > >> logical bytes and then run encoding/compression. > > I would be curious to see how complex the required logic ends up, > > especially when taking account nested types. A pathological case would > > be a nested type with more than 2^31 repeated values in a single "row". > > > >> Given that row groups are > >> typically targeting a size of 64/128MB that should work rather well unless > >> the data in question is of extremely low entropy and compresses too well. > > IIRC some writers (perhaps parquet-rs?) always write a single row > > group, however large the data. > > > > Regards > > > > Antoine. > > > > >