Re: Parquet tuning tool/ lib

Ed Seidl Wed, 28 May 2025 11:37:26 -0700

What my tool does is, for a given input parquet file and for each column, cycle 
through all combinations of column encoding, column compression, and max 
dictionary size. When it's done the optimal settings (to minimize file size) 
for those are given for each column, along with code snippets to set them 
(either pyarrow or parquet-cpp at the moment).


In the past I've done a little tuning work on row group/page size for point 
lookup on hdfs, but that was all manual.

Ed

On 2025/05/28 17:58:38 Ashish Singh wrote:
> We typically aim at 800 Mbs file sizes for object stores. However, we are
> not interested in changing file content or size as part of the parquet
> tuning. We simply want to optimize the content of file to optimize for a
> particular resource like, storage size, read speed, write speed, etc.
> 
> 
> On Wed, May 28, 2025 at 10:27 AM Adrian Garcia Badaracco <
> adr...@adriangb.com> wrote:
> 
> > I’ve often seen 100MB as a “reasonable” default choice. But I don’t have a
> > lot of data to substantiate that. On our system we’ve found that smaller
> > (e.g. 5MB) leads to too much overhead and larger (e.g. 5GB) leads to OOMs,
> > too much overhead parsing footers / stats even if you’re only going to read
> > a couple rows, etc.
> >
> > > On May 28, 2025, at 12:23 PM, Steve Loughran <ste...@cloudera.com.invalid>
> > wrote:
> > >
> > > ?interesting q here.
> > >
> > > TPC benchmarks do give different numbers for different file sizes,
> > > independent of the nominal TPC scale (e.g different values for 10TB
> > > numbers, with everything else the same)
> > >
> > > I know it's all so dependent on cluster, app etc -but what sizes do
> > people
> > > use in (a) benchmarks and (b) production datasets? Or at least: what
> > > minimum sizes show up as very inefficient, what large sizes seem to show
> > no
> > > incremental benefit .
> > >
> > > The minimum size is going to be so significant for distributed engines
> > like
> > > Spark, as there's the work setup costs, but so does using cloud storage
> > as
> > > the data lake -there's overhead in simply opening files and reading
> > footers
> > > which will penalise the files. Parquet through DuckDb is inevitably going
> > > to be very different
> > >
> > > papers with empirical data welcome..
> > >
> > >
> > > Steve
> > >
> > >
> > > On Wed, 28 May 2025 at 17:52, Ashish Singh <singhashish....@gmail.com>
> > > wrote:
> > >
> > >> Thanks all!
> > >>
> > >> Yea, I am mostly looking at available tooling to tune parquet files.
> > >>
> > >> Ed, I would be interested to discuss this. Would you (or anyone else)
> > like
> > >> to have a dedicated discussion on this? To provide some context, at
> > >> Pinterest we are actively looking into adopting/ building such tooling.
> > We,
> > >> like others, have been traditionally relying on manual tuning so far,
> > which
> > >> isn't really scalable.
> > >>
> > >> Best Regards,
> > >> Ashish
> > >>
> > >>
> > >> On Wed, May 28, 2025 at 9:29 AM Ed Seidl <etse...@apache.org> wrote:
> > >>
> > >>> I'm developing such a tool for my own use. Right now it only optimizes
> > >> for
> > >>> size, but I'm planning to add query time later. I'm trying to get it
> > open
> > >>> sourced, but the wheels of bureaucracy turn slowly :(
> > >>>
> > >>> Ed
> > >>>
> > >>> On 2025/05/28 15:36:37 Martin Loncaric wrote:
> > >>>> I think Ashish's question was about determining the right
> > configuration
> > >>> in
> > >>>> the first place - IIUC parquet-rewrite requires the user to pass these
> > >>> in.
> > >>>>
> > >>>> I'm not aware of any tool to choose good Parquet configurations
> > >>>> automatically. I sometimes use the parquet-tools pip package / CLI to
> > >>>> inspect Parquet and see how files are configured, but I've only tuned
> > >>>> manually.
> > >>>>
> > >>>> On Tue, May 27, 2025, 16:22 Andrew Lamb <andrewlam...@gmail.com>
> > >> wrote:
> > >>>>
> > >>>>> We have one in the arrow-rs repository: parquet-rewrite[1]
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> [1]:
> > >>>>>
> > >>>>>
> > >>>
> > >>
> > https://github.com/apache/arrow-rs/blob/0da003becbd6489f483b70e5914a242edd8c6d1a/parquet/src/bin/parquet-rewrite.rs#L18
> > >>>>>
> > >>>>> On Tue, May 27, 2025 at 12:41 PM Ashish Singh <asi...@apache.org>
> > >>> wrote:
> > >>>>>
> > >>>>>> Hey all,
> > >>>>>>
> > >>>>>> Is there any tool/ lib folks use to tune parquet configs to
> > >> optimize
> > >>> for
> > >>>>>> storage size / read/ write speed?
> > >>>>>>
> > >>>>>> - Ashish
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> >
>

Re: Parquet tuning tool/ lib

Reply via email to