On Thu, 29 Aug 2024 15:53:36 +0100
Raphael Taylor-Davies
<r.taylordav...@googlemail.com.INVALID>
wrote:
> > IIRC some writers (perhaps parquet-rs?) always write a single row  
> FWIW parquet-rs will write multiple row groups depending on the 
> configuration. The defaults will write row groups of up to 1M rows [1].

Oops, sorry for misremembering!

Best regards

Antoine.


> Perhaps you might be thinking of parquet-cpp which for a very long time 
> had very high defaults, leading it to often just create a single massive 
> row group [2]? I believe this was a bug and has been fixed.
> 
> Kind Regards,
> 
> Raphael
> 
> [1]: 
> https://docs.rs/parquet/latest/parquet/file/properties/constant.DEFAULT_MAX_ROW_GROUP_SIZE.html
> [2]: https://github.com/apache/arrow/pull/36012
> 
> On 29/08/2024 15:11, Antoine Pitrou wrote:
> > On Thu, 29 Aug 2024 12:33:25 +0200
> > Alkis Evlogimenos
> > <alkis.evlogime...@databricks.com.INVALID>
> > wrote:  
> >> The simplest fix for a writer is to limit row groups to 2^31
> >> logical bytes and then run encoding/compression.  
> > I would be curious to see how complex the required logic ends up,
> > especially when taking account nested types. A pathological case would
> > be a nested type with more than 2^31 repeated values in a single "row".
> >  
> >> Given that row groups are
> >> typically targeting a size of 64/128MB that should work rather well unless
> >> the data in question is of extremely low entropy and compresses too well.  
> > IIRC some writers (perhaps parquet-rs?) always write a single row
> > group, however large the data.
> >
> > Regards
> >
> > Antoine.
> >
> >  
> 



Reply via email to