when clients can do parallel reads of column chucks (e.g vector IO), then
the sizes of rowgroups really matters: if a file is split such that parquet
lib can request column chunks/pages in parallel then load time will be
less. But what does that mean for processing time?



On Fri, 30 May 2025 at 13:17, Claire McGinty <claire.d.mcgi...@gmail.com>
wrote:

> I'm curious about this as well. I've made some attempts at write
> benchmarking but the challenge is that the "optimal" configuration is so
> dependent on how you intend to read the data... for example, we used to
> recommend a 512MB block size as a reasonable default, which worked well for
> wide schemas that were always read with tiny projections, but not so good
> for narrow schemas intended to be read in their entirety. Same with the
> page size param - bumping the default value improves compression, but
> depending on the distribution of column values, statistics filtering
> degrades.
>
> A lot of the time it ends up being a tradeoff between saving money on
> storage, or on downstream processing costs (and as Steve mentioned, even
> that varies by processing engine).
>
> It could be helpful to publish some kind of qualitative tuning guide
> somewhere in the Parquet docs, since I feel like I've mostly
> learned through trial and error, and reading through parquet-java internals
> :)
>
> Claire
>
> On Wed, May 28, 2025 at 8:40 PM Ashish Singh <asi...@apache.org> wrote:
>
> > > FWIW the tool is python, so I use pyarrow when generating numbers. I
> > haven't yet tested to see how well the results translate to other
> writers.
> >
> > Would be curious about this too.
> >
> > On Wed, May 28, 2025 at 12:28 PM Ed Seidl <etse...@apache.org> wrote:
> >
> > > Yes, right now we're targeting pyarrow and parquet-cpp, but will add
> > > parquet-rs soon too. We haven't used parquet-java for quite a while, so
> > > I've lost track of the possible configs there.
> > >
> > > All columns get PLAIN and DICTIONARY encoding, and then I'll add in
> other
> > > encodings based on the physical type of the column. Other than that,
> > there
> > > are no other heuristics, but there are C/L flags to limit the test
> space
> > > (can select only certain columns and cut down on compression codecs for
> > > instance).
> > >
> > > FWIW the tool is python, so I use pyarrow when generating numbers. I
> > > haven't yet tested to see how well the results translate to other
> > writers.
> > >
> > > On 2025/05/28 19:13:41 Ashish Singh wrote:
> > > > > What my tool does is, for a given input parquet file and for each
> > > column,
> > > > cycle through all combinations of column encoding, column
> compression,
> > > and
> > > > max dictionary size. When it's done the optimal settings (to minimize
> > > file
> > > > size) for those are given for each column, along with code snippets
> to
> > > set
> > > > them (either pyarrow or parquet-cpp at the moment).
> > > >
> > > > Thanks Ed. Do you cycle through all possible configs for an input
> file
> > or
> > > > do you also use some heuristics to narrow the search space? The per
> > > > column compression tuning seems to be not achievable on parquet-java
> > > > currently, sounds like your use-case is primarily on pyarrow and
> > > > parquet-cpp?
> > > >
> > > >
> > > > On Wed, May 28, 2025 at 11:37 AM Ed Seidl <etse...@apache.org>
> wrote:
> > > >
> > > > > What my tool does is, for a given input parquet file and for each
> > > column,
> > > > > cycle through all combinations of column encoding, column
> > compression,
> > > and
> > > > > max dictionary size. When it's done the optimal settings (to
> minimize
> > > file
> > > > > size) for those are given for each column, along with code snippets
> > to
> > > set
> > > > > them (either pyarrow or parquet-cpp at the moment).
> > > > >
> > > > > In the past I've done a little tuning work on row group/page size
> for
> > > > > point lookup on hdfs, but that was all manual.
> > > > >
> > > > > Ed
> > > > >
> > > > > On 2025/05/28 17:58:38 Ashish Singh wrote:
> > > > > > We typically aim at 800 Mbs file sizes for object stores.
> However,
> > > we are
> > > > > > not interested in changing file content or size as part of the
> > > parquet
> > > > > > tuning. We simply want to optimize the content of file to
> optimize
> > > for a
> > > > > > particular resource like, storage size, read speed, write speed,
> > etc.
> > > > > >
> > > > > >
> > > > > > On Wed, May 28, 2025 at 10:27 AM Adrian Garcia Badaracco <
> > > > > > adr...@adriangb.com> wrote:
> > > > > >
> > > > > > > I’ve often seen 100MB as a “reasonable” default choice. But I
> > don’t
> > > > > have a
> > > > > > > lot of data to substantiate that. On our system we’ve found
> that
> > > > > smaller
> > > > > > > (e.g. 5MB) leads to too much overhead and larger (e.g. 5GB)
> leads
> > > to
> > > > > OOMs,
> > > > > > > too much overhead parsing footers / stats even if you’re only
> > > going to
> > > > > read
> > > > > > > a couple rows, etc.
> > > > > > >
> > > > > > > > On May 28, 2025, at 12:23 PM, Steve Loughran
> > > > > <ste...@cloudera.com.invalid>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > ?interesting q here.
> > > > > > > >
> > > > > > > > TPC benchmarks do give different numbers for different file
> > > sizes,
> > > > > > > > independent of the nominal TPC scale (e.g different values
> for
> > > 10TB
> > > > > > > > numbers, with everything else the same)
> > > > > > > >
> > > > > > > > I know it's all so dependent on cluster, app etc -but what
> > sizes
> > > do
> > > > > > > people
> > > > > > > > use in (a) benchmarks and (b) production datasets? Or at
> least:
> > > what
> > > > > > > > minimum sizes show up as very inefficient, what large sizes
> > seem
> > > to
> > > > > show
> > > > > > > no
> > > > > > > > incremental benefit .
> > > > > > > >
> > > > > > > > The minimum size is going to be so significant for
> distributed
> > > > > engines
> > > > > > > like
> > > > > > > > Spark, as there's the work setup costs, but so does using
> cloud
> > > > > storage
> > > > > > > as
> > > > > > > > the data lake -there's overhead in simply opening files and
> > > reading
> > > > > > > footers
> > > > > > > > which will penalise the files. Parquet through DuckDb is
> > > inevitably
> > > > > going
> > > > > > > > to be very different
> > > > > > > >
> > > > > > > > papers with empirical data welcome..
> > > > > > > >
> > > > > > > >
> > > > > > > > Steve
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, 28 May 2025 at 17:52, Ashish Singh <
> > > > > singhashish....@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Thanks all!
> > > > > > > >>
> > > > > > > >> Yea, I am mostly looking at available tooling to tune
> parquet
> > > files.
> > > > > > > >>
> > > > > > > >> Ed, I would be interested to discuss this. Would you (or
> > anyone
> > > > > else)
> > > > > > > like
> > > > > > > >> to have a dedicated discussion on this? To provide some
> > > context, at
> > > > > > > >> Pinterest we are actively looking into adopting/ building
> such
> > > > > tooling.
> > > > > > > We,
> > > > > > > >> like others, have been traditionally relying on manual
> tuning
> > so
> > > > > far,
> > > > > > > which
> > > > > > > >> isn't really scalable.
> > > > > > > >>
> > > > > > > >> Best Regards,
> > > > > > > >> Ashish
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Wed, May 28, 2025 at 9:29 AM Ed Seidl <
> etse...@apache.org>
> > > > > wrote:
> > > > > > > >>
> > > > > > > >>> I'm developing such a tool for my own use. Right now it
> only
> > > > > optimizes
> > > > > > > >> for
> > > > > > > >>> size, but I'm planning to add query time later. I'm trying
> to
> > > get
> > > > > it
> > > > > > > open
> > > > > > > >>> sourced, but the wheels of bureaucracy turn slowly :(
> > > > > > > >>>
> > > > > > > >>> Ed
> > > > > > > >>>
> > > > > > > >>> On 2025/05/28 15:36:37 Martin Loncaric wrote:
> > > > > > > >>>> I think Ashish's question was about determining the right
> > > > > > > configuration
> > > > > > > >>> in
> > > > > > > >>>> the first place - IIUC parquet-rewrite requires the user
> to
> > > pass
> > > > > these
> > > > > > > >>> in.
> > > > > > > >>>>
> > > > > > > >>>> I'm not aware of any tool to choose good Parquet
> > > configurations
> > > > > > > >>>> automatically. I sometimes use the parquet-tools pip
> > package /
> > > > > CLI to
> > > > > > > >>>> inspect Parquet and see how files are configured, but I've
> > > only
> > > > > tuned
> > > > > > > >>>> manually.
> > > > > > > >>>>
> > > > > > > >>>> On Tue, May 27, 2025, 16:22 Andrew Lamb <
> > > andrewlam...@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >>>>
> > > > > > > >>>>> We have one in the arrow-rs repository:
> parquet-rewrite[1]
> > > > > > > >>>>>
> > > > > > > >>>>>
> > > > > > > >>>>>
> > > > > > > >>>>> [1]:
> > > > > > > >>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > >
> > >
> >
> https://github.com/apache/arrow-rs/blob/0da003becbd6489f483b70e5914a242edd8c6d1a/parquet/src/bin/parquet-rewrite.rs#L18
> > > > > > > >>>>>
> > > > > > > >>>>> On Tue, May 27, 2025 at 12:41 PM Ashish Singh <
> > > asi...@apache.org
> > > > > >
> > > > > > > >>> wrote:
> > > > > > > >>>>>
> > > > > > > >>>>>> Hey all,
> > > > > > > >>>>>>
> > > > > > > >>>>>> Is there any tool/ lib folks use to tune parquet configs
> > to
> > > > > > > >> optimize
> > > > > > > >>> for
> > > > > > > >>>>>> storage size / read/ write speed?
> > > > > > > >>>>>>
> > > > > > > >>>>>> - Ashish
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to