On Thu, 29 Aug 2024 12:33:25 +0200 Alkis Evlogimenos <alkis.evlogime...@databricks.com.INVALID> wrote: > > The simplest fix for a writer is to limit row groups to 2^31 > logical bytes and then run encoding/compression.
I would be curious to see how complex the required logic ends up, especially when taking account nested types. A pathological case would be a nested type with more than 2^31 repeated values in a single "row". > Given that row groups are > typically targeting a size of 64/128MB that should work rather well unless > the data in question is of extremely low entropy and compresses too well. IIRC some writers (perhaps parquet-rs?) always write a single row group, however large the data. Regards Antoine.