> What my tool does is, for a given input parquet file and for each column,
cycle through all combinations of column encoding, column compression, and
max dictionary size. When it's done the optimal settings (to minimize file
size) for those are given for each column, along with code snippets to set
them (either pyarrow or parquet-cpp at the moment).

Thanks Ed. Do you cycle through all possible configs for an input file or
do you also use some heuristics to narrow the search space? The per
column compression tuning seems to be not achievable on parquet-java
currently, sounds like your use-case is primarily on pyarrow and
parquet-cpp?


On Wed, May 28, 2025 at 11:37 AM Ed Seidl <etse...@apache.org> wrote:

> What my tool does is, for a given input parquet file and for each column,
> cycle through all combinations of column encoding, column compression, and
> max dictionary size. When it's done the optimal settings (to minimize file
> size) for those are given for each column, along with code snippets to set
> them (either pyarrow or parquet-cpp at the moment).
>
> In the past I've done a little tuning work on row group/page size for
> point lookup on hdfs, but that was all manual.
>
> Ed
>
> On 2025/05/28 17:58:38 Ashish Singh wrote:
> > We typically aim at 800 Mbs file sizes for object stores. However, we are
> > not interested in changing file content or size as part of the parquet
> > tuning. We simply want to optimize the content of file to optimize for a
> > particular resource like, storage size, read speed, write speed, etc.
> >
> >
> > On Wed, May 28, 2025 at 10:27 AM Adrian Garcia Badaracco <
> > adr...@adriangb.com> wrote:
> >
> > > I’ve often seen 100MB as a “reasonable” default choice. But I don’t
> have a
> > > lot of data to substantiate that. On our system we’ve found that
> smaller
> > > (e.g. 5MB) leads to too much overhead and larger (e.g. 5GB) leads to
> OOMs,
> > > too much overhead parsing footers / stats even if you’re only going to
> read
> > > a couple rows, etc.
> > >
> > > > On May 28, 2025, at 12:23 PM, Steve Loughran
> <ste...@cloudera.com.invalid>
> > > wrote:
> > > >
> > > > ?interesting q here.
> > > >
> > > > TPC benchmarks do give different numbers for different file sizes,
> > > > independent of the nominal TPC scale (e.g different values for 10TB
> > > > numbers, with everything else the same)
> > > >
> > > > I know it's all so dependent on cluster, app etc -but what sizes do
> > > people
> > > > use in (a) benchmarks and (b) production datasets? Or at least: what
> > > > minimum sizes show up as very inefficient, what large sizes seem to
> show
> > > no
> > > > incremental benefit .
> > > >
> > > > The minimum size is going to be so significant for distributed
> engines
> > > like
> > > > Spark, as there's the work setup costs, but so does using cloud
> storage
> > > as
> > > > the data lake -there's overhead in simply opening files and reading
> > > footers
> > > > which will penalise the files. Parquet through DuckDb is inevitably
> going
> > > > to be very different
> > > >
> > > > papers with empirical data welcome..
> > > >
> > > >
> > > > Steve
> > > >
> > > >
> > > > On Wed, 28 May 2025 at 17:52, Ashish Singh <
> singhashish....@gmail.com>
> > > > wrote:
> > > >
> > > >> Thanks all!
> > > >>
> > > >> Yea, I am mostly looking at available tooling to tune parquet files.
> > > >>
> > > >> Ed, I would be interested to discuss this. Would you (or anyone
> else)
> > > like
> > > >> to have a dedicated discussion on this? To provide some context, at
> > > >> Pinterest we are actively looking into adopting/ building such
> tooling.
> > > We,
> > > >> like others, have been traditionally relying on manual tuning so
> far,
> > > which
> > > >> isn't really scalable.
> > > >>
> > > >> Best Regards,
> > > >> Ashish
> > > >>
> > > >>
> > > >> On Wed, May 28, 2025 at 9:29 AM Ed Seidl <etse...@apache.org>
> wrote:
> > > >>
> > > >>> I'm developing such a tool for my own use. Right now it only
> optimizes
> > > >> for
> > > >>> size, but I'm planning to add query time later. I'm trying to get
> it
> > > open
> > > >>> sourced, but the wheels of bureaucracy turn slowly :(
> > > >>>
> > > >>> Ed
> > > >>>
> > > >>> On 2025/05/28 15:36:37 Martin Loncaric wrote:
> > > >>>> I think Ashish's question was about determining the right
> > > configuration
> > > >>> in
> > > >>>> the first place - IIUC parquet-rewrite requires the user to pass
> these
> > > >>> in.
> > > >>>>
> > > >>>> I'm not aware of any tool to choose good Parquet configurations
> > > >>>> automatically. I sometimes use the parquet-tools pip package /
> CLI to
> > > >>>> inspect Parquet and see how files are configured, but I've only
> tuned
> > > >>>> manually.
> > > >>>>
> > > >>>> On Tue, May 27, 2025, 16:22 Andrew Lamb <andrewlam...@gmail.com>
> > > >> wrote:
> > > >>>>
> > > >>>>> We have one in the arrow-rs repository: parquet-rewrite[1]
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> [1]:
> > > >>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> https://github.com/apache/arrow-rs/blob/0da003becbd6489f483b70e5914a242edd8c6d1a/parquet/src/bin/parquet-rewrite.rs#L18
> > > >>>>>
> > > >>>>> On Tue, May 27, 2025 at 12:41 PM Ashish Singh <asi...@apache.org
> >
> > > >>> wrote:
> > > >>>>>
> > > >>>>>> Hey all,
> > > >>>>>>
> > > >>>>>> Is there any tool/ lib folks use to tune parquet configs to
> > > >> optimize
> > > >>> for
> > > >>>>>> storage size / read/ write speed?
> > > >>>>>>
> > > >>>>>> - Ashish
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > >
> >
>

Reply via email to