Re: Parquet tuning tool/ lib

Ashish Singh Wed, 28 May 2025 12:41:59 -0700

> FWIW the tool is python, so I use pyarrow when generating numbers. I
haven't yet tested to see how well the results translate to other writers.


Would be curious about this too.

On Wed, May 28, 2025 at 12:28 PM Ed Seidl <etse...@apache.org> wrote:

> Yes, right now we're targeting pyarrow and parquet-cpp, but will add
> parquet-rs soon too. We haven't used parquet-java for quite a while, so
> I've lost track of the possible configs there.
>
> All columns get PLAIN and DICTIONARY encoding, and then I'll add in other
> encodings based on the physical type of the column. Other than that, there
> are no other heuristics, but there are C/L flags to limit the test space
> (can select only certain columns and cut down on compression codecs for
> instance).
>
> FWIW the tool is python, so I use pyarrow when generating numbers. I
> haven't yet tested to see how well the results translate to other writers.
>
> On 2025/05/28 19:13:41 Ashish Singh wrote:
> > > What my tool does is, for a given input parquet file and for each
> column,
> > cycle through all combinations of column encoding, column compression,
> and
> > max dictionary size. When it's done the optimal settings (to minimize
> file
> > size) for those are given for each column, along with code snippets to
> set
> > them (either pyarrow or parquet-cpp at the moment).
> >
> > Thanks Ed. Do you cycle through all possible configs for an input file or
> > do you also use some heuristics to narrow the search space? The per
> > column compression tuning seems to be not achievable on parquet-java
> > currently, sounds like your use-case is primarily on pyarrow and
> > parquet-cpp?
> >
> >
> > On Wed, May 28, 2025 at 11:37 AM Ed Seidl <etse...@apache.org> wrote:
> >
> > > What my tool does is, for a given input parquet file and for each
> column,
> > > cycle through all combinations of column encoding, column compression,
> and
> > > max dictionary size. When it's done the optimal settings (to minimize
> file
> > > size) for those are given for each column, along with code snippets to
> set
> > > them (either pyarrow or parquet-cpp at the moment).
> > >
> > > In the past I've done a little tuning work on row group/page size for
> > > point lookup on hdfs, but that was all manual.
> > >
> > > Ed
> > >
> > > On 2025/05/28 17:58:38 Ashish Singh wrote:
> > > > We typically aim at 800 Mbs file sizes for object stores. However,
> we are
> > > > not interested in changing file content or size as part of the
> parquet
> > > > tuning. We simply want to optimize the content of file to optimize
> for a
> > > > particular resource like, storage size, read speed, write speed, etc.
> > > >
> > > >
> > > > On Wed, May 28, 2025 at 10:27 AM Adrian Garcia Badaracco <
> > > > adr...@adriangb.com> wrote:
> > > >
> > > > > I’ve often seen 100MB as a “reasonable” default choice. But I don’t
> > > have a
> > > > > lot of data to substantiate that. On our system we’ve found that
> > > smaller
> > > > > (e.g. 5MB) leads to too much overhead and larger (e.g. 5GB) leads
> to
> > > OOMs,
> > > > > too much overhead parsing footers / stats even if you’re only
> going to
> > > read
> > > > > a couple rows, etc.
> > > > >
> > > > > > On May 28, 2025, at 12:23 PM, Steve Loughran
> > > <ste...@cloudera.com.invalid>
> > > > > wrote:
> > > > > >
> > > > > > ?interesting q here.
> > > > > >
> > > > > > TPC benchmarks do give different numbers for different file
> sizes,
> > > > > > independent of the nominal TPC scale (e.g different values for
> 10TB
> > > > > > numbers, with everything else the same)
> > > > > >
> > > > > > I know it's all so dependent on cluster, app etc -but what sizes
> do
> > > > > people
> > > > > > use in (a) benchmarks and (b) production datasets? Or at least:
> what
> > > > > > minimum sizes show up as very inefficient, what large sizes seem
> to
> > > show
> > > > > no
> > > > > > incremental benefit .
> > > > > >
> > > > > > The minimum size is going to be so significant for distributed
> > > engines
> > > > > like
> > > > > > Spark, as there's the work setup costs, but so does using cloud
> > > storage
> > > > > as
> > > > > > the data lake -there's overhead in simply opening files and
> reading
> > > > > footers
> > > > > > which will penalise the files. Parquet through DuckDb is
> inevitably
> > > going
> > > > > > to be very different
> > > > > >
> > > > > > papers with empirical data welcome..
> > > > > >
> > > > > >
> > > > > > Steve
> > > > > >
> > > > > >
> > > > > > On Wed, 28 May 2025 at 17:52, Ashish Singh <
> > > singhashish....@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> Thanks all!
> > > > > >>
> > > > > >> Yea, I am mostly looking at available tooling to tune parquet
> files.
> > > > > >>
> > > > > >> Ed, I would be interested to discuss this. Would you (or anyone
> > > else)
> > > > > like
> > > > > >> to have a dedicated discussion on this? To provide some
> context, at
> > > > > >> Pinterest we are actively looking into adopting/ building such
> > > tooling.
> > > > > We,
> > > > > >> like others, have been traditionally relying on manual tuning so
> > > far,
> > > > > which
> > > > > >> isn't really scalable.
> > > > > >>
> > > > > >> Best Regards,
> > > > > >> Ashish
> > > > > >>
> > > > > >>
> > > > > >> On Wed, May 28, 2025 at 9:29 AM Ed Seidl <etse...@apache.org>
> > > wrote:
> > > > > >>
> > > > > >>> I'm developing such a tool for my own use. Right now it only
> > > optimizes
> > > > > >> for
> > > > > >>> size, but I'm planning to add query time later. I'm trying to
> get
> > > it
> > > > > open
> > > > > >>> sourced, but the wheels of bureaucracy turn slowly :(
> > > > > >>>
> > > > > >>> Ed
> > > > > >>>
> > > > > >>> On 2025/05/28 15:36:37 Martin Loncaric wrote:
> > > > > >>>> I think Ashish's question was about determining the right
> > > > > configuration
> > > > > >>> in
> > > > > >>>> the first place - IIUC parquet-rewrite requires the user to
> pass
> > > these
> > > > > >>> in.
> > > > > >>>>
> > > > > >>>> I'm not aware of any tool to choose good Parquet
> configurations
> > > > > >>>> automatically. I sometimes use the parquet-tools pip package /
> > > CLI to
> > > > > >>>> inspect Parquet and see how files are configured, but I've
> only
> > > tuned
> > > > > >>>> manually.
> > > > > >>>>
> > > > > >>>> On Tue, May 27, 2025, 16:22 Andrew Lamb <
> andrewlam...@gmail.com>
> > > > > >> wrote:
> > > > > >>>>
> > > > > >>>>> We have one in the arrow-rs repository: parquet-rewrite[1]
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> [1]:
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > >
> https://github.com/apache/arrow-rs/blob/0da003becbd6489f483b70e5914a242edd8c6d1a/parquet/src/bin/parquet-rewrite.rs#L18
> > > > > >>>>>
> > > > > >>>>> On Tue, May 27, 2025 at 12:41 PM Ashish Singh <
> asi...@apache.org
> > > >
> > > > > >>> wrote:
> > > > > >>>>>
> > > > > >>>>>> Hey all,
> > > > > >>>>>>
> > > > > >>>>>> Is there any tool/ lib folks use to tune parquet configs to
> > > > > >> optimize
> > > > > >>> for
> > > > > >>>>>> storage size / read/ write speed?
> > > > > >>>>>>
> > > > > >>>>>> - Ashish
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Parquet tuning tool/ lib

Reply via email to