when clients can do parallel reads of column chucks (e.g vector IO), then the sizes of rowgroups really matters: if a file is split such that parquet lib can request column chunks/pages in parallel then load time will be less. But what does that mean for processing time?
On Fri, 30 May 2025 at 13:17, Claire McGinty <claire.d.mcgi...@gmail.com> wrote: > I'm curious about this as well. I've made some attempts at write > benchmarking but the challenge is that the "optimal" configuration is so > dependent on how you intend to read the data... for example, we used to > recommend a 512MB block size as a reasonable default, which worked well for > wide schemas that were always read with tiny projections, but not so good > for narrow schemas intended to be read in their entirety. Same with the > page size param - bumping the default value improves compression, but > depending on the distribution of column values, statistics filtering > degrades. > > A lot of the time it ends up being a tradeoff between saving money on > storage, or on downstream processing costs (and as Steve mentioned, even > that varies by processing engine). > > It could be helpful to publish some kind of qualitative tuning guide > somewhere in the Parquet docs, since I feel like I've mostly > learned through trial and error, and reading through parquet-java internals > :) > > Claire > > On Wed, May 28, 2025 at 8:40 PM Ashish Singh <asi...@apache.org> wrote: > > > > FWIW the tool is python, so I use pyarrow when generating numbers. I > > haven't yet tested to see how well the results translate to other > writers. > > > > Would be curious about this too. > > > > On Wed, May 28, 2025 at 12:28 PM Ed Seidl <etse...@apache.org> wrote: > > > > > Yes, right now we're targeting pyarrow and parquet-cpp, but will add > > > parquet-rs soon too. We haven't used parquet-java for quite a while, so > > > I've lost track of the possible configs there. > > > > > > All columns get PLAIN and DICTIONARY encoding, and then I'll add in > other > > > encodings based on the physical type of the column. Other than that, > > there > > > are no other heuristics, but there are C/L flags to limit the test > space > > > (can select only certain columns and cut down on compression codecs for > > > instance). > > > > > > FWIW the tool is python, so I use pyarrow when generating numbers. I > > > haven't yet tested to see how well the results translate to other > > writers. > > > > > > On 2025/05/28 19:13:41 Ashish Singh wrote: > > > > > What my tool does is, for a given input parquet file and for each > > > column, > > > > cycle through all combinations of column encoding, column > compression, > > > and > > > > max dictionary size. When it's done the optimal settings (to minimize > > > file > > > > size) for those are given for each column, along with code snippets > to > > > set > > > > them (either pyarrow or parquet-cpp at the moment). > > > > > > > > Thanks Ed. Do you cycle through all possible configs for an input > file > > or > > > > do you also use some heuristics to narrow the search space? The per > > > > column compression tuning seems to be not achievable on parquet-java > > > > currently, sounds like your use-case is primarily on pyarrow and > > > > parquet-cpp? > > > > > > > > > > > > On Wed, May 28, 2025 at 11:37 AM Ed Seidl <etse...@apache.org> > wrote: > > > > > > > > > What my tool does is, for a given input parquet file and for each > > > column, > > > > > cycle through all combinations of column encoding, column > > compression, > > > and > > > > > max dictionary size. When it's done the optimal settings (to > minimize > > > file > > > > > size) for those are given for each column, along with code snippets > > to > > > set > > > > > them (either pyarrow or parquet-cpp at the moment). > > > > > > > > > > In the past I've done a little tuning work on row group/page size > for > > > > > point lookup on hdfs, but that was all manual. > > > > > > > > > > Ed > > > > > > > > > > On 2025/05/28 17:58:38 Ashish Singh wrote: > > > > > > We typically aim at 800 Mbs file sizes for object stores. > However, > > > we are > > > > > > not interested in changing file content or size as part of the > > > parquet > > > > > > tuning. We simply want to optimize the content of file to > optimize > > > for a > > > > > > particular resource like, storage size, read speed, write speed, > > etc. > > > > > > > > > > > > > > > > > > On Wed, May 28, 2025 at 10:27 AM Adrian Garcia Badaracco < > > > > > > adr...@adriangb.com> wrote: > > > > > > > > > > > > > I’ve often seen 100MB as a “reasonable” default choice. But I > > don’t > > > > > have a > > > > > > > lot of data to substantiate that. On our system we’ve found > that > > > > > smaller > > > > > > > (e.g. 5MB) leads to too much overhead and larger (e.g. 5GB) > leads > > > to > > > > > OOMs, > > > > > > > too much overhead parsing footers / stats even if you’re only > > > going to > > > > > read > > > > > > > a couple rows, etc. > > > > > > > > > > > > > > > On May 28, 2025, at 12:23 PM, Steve Loughran > > > > > <ste...@cloudera.com.invalid> > > > > > > > wrote: > > > > > > > > > > > > > > > > ?interesting q here. > > > > > > > > > > > > > > > > TPC benchmarks do give different numbers for different file > > > sizes, > > > > > > > > independent of the nominal TPC scale (e.g different values > for > > > 10TB > > > > > > > > numbers, with everything else the same) > > > > > > > > > > > > > > > > I know it's all so dependent on cluster, app etc -but what > > sizes > > > do > > > > > > > people > > > > > > > > use in (a) benchmarks and (b) production datasets? Or at > least: > > > what > > > > > > > > minimum sizes show up as very inefficient, what large sizes > > seem > > > to > > > > > show > > > > > > > no > > > > > > > > incremental benefit . > > > > > > > > > > > > > > > > The minimum size is going to be so significant for > distributed > > > > > engines > > > > > > > like > > > > > > > > Spark, as there's the work setup costs, but so does using > cloud > > > > > storage > > > > > > > as > > > > > > > > the data lake -there's overhead in simply opening files and > > > reading > > > > > > > footers > > > > > > > > which will penalise the files. Parquet through DuckDb is > > > inevitably > > > > > going > > > > > > > > to be very different > > > > > > > > > > > > > > > > papers with empirical data welcome.. > > > > > > > > > > > > > > > > > > > > > > > > Steve > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 28 May 2025 at 17:52, Ashish Singh < > > > > > singhashish....@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > >> Thanks all! > > > > > > > >> > > > > > > > >> Yea, I am mostly looking at available tooling to tune > parquet > > > files. > > > > > > > >> > > > > > > > >> Ed, I would be interested to discuss this. Would you (or > > anyone > > > > > else) > > > > > > > like > > > > > > > >> to have a dedicated discussion on this? To provide some > > > context, at > > > > > > > >> Pinterest we are actively looking into adopting/ building > such > > > > > tooling. > > > > > > > We, > > > > > > > >> like others, have been traditionally relying on manual > tuning > > so > > > > > far, > > > > > > > which > > > > > > > >> isn't really scalable. > > > > > > > >> > > > > > > > >> Best Regards, > > > > > > > >> Ashish > > > > > > > >> > > > > > > > >> > > > > > > > >> On Wed, May 28, 2025 at 9:29 AM Ed Seidl < > etse...@apache.org> > > > > > wrote: > > > > > > > >> > > > > > > > >>> I'm developing such a tool for my own use. Right now it > only > > > > > optimizes > > > > > > > >> for > > > > > > > >>> size, but I'm planning to add query time later. I'm trying > to > > > get > > > > > it > > > > > > > open > > > > > > > >>> sourced, but the wheels of bureaucracy turn slowly :( > > > > > > > >>> > > > > > > > >>> Ed > > > > > > > >>> > > > > > > > >>> On 2025/05/28 15:36:37 Martin Loncaric wrote: > > > > > > > >>>> I think Ashish's question was about determining the right > > > > > > > configuration > > > > > > > >>> in > > > > > > > >>>> the first place - IIUC parquet-rewrite requires the user > to > > > pass > > > > > these > > > > > > > >>> in. > > > > > > > >>>> > > > > > > > >>>> I'm not aware of any tool to choose good Parquet > > > configurations > > > > > > > >>>> automatically. I sometimes use the parquet-tools pip > > package / > > > > > CLI to > > > > > > > >>>> inspect Parquet and see how files are configured, but I've > > > only > > > > > tuned > > > > > > > >>>> manually. > > > > > > > >>>> > > > > > > > >>>> On Tue, May 27, 2025, 16:22 Andrew Lamb < > > > andrewlam...@gmail.com> > > > > > > > >> wrote: > > > > > > > >>>> > > > > > > > >>>>> We have one in the arrow-rs repository: > parquet-rewrite[1] > > > > > > > >>>>> > > > > > > > >>>>> > > > > > > > >>>>> > > > > > > > >>>>> [1]: > > > > > > > >>>>> > > > > > > > >>>>> > > > > > > > >>> > > > > > > > >> > > > > > > > > > > > > > > > > > > https://github.com/apache/arrow-rs/blob/0da003becbd6489f483b70e5914a242edd8c6d1a/parquet/src/bin/parquet-rewrite.rs#L18 > > > > > > > >>>>> > > > > > > > >>>>> On Tue, May 27, 2025 at 12:41 PM Ashish Singh < > > > asi...@apache.org > > > > > > > > > > > > > >>> wrote: > > > > > > > >>>>> > > > > > > > >>>>>> Hey all, > > > > > > > >>>>>> > > > > > > > >>>>>> Is there any tool/ lib folks use to tune parquet configs > > to > > > > > > > >> optimize > > > > > > > >>> for > > > > > > > >>>>>> storage size / read/ write speed? > > > > > > > >>>>>> > > > > > > > >>>>>> - Ashish > > > > > > > >>>>>> > > > > > > > >>>>> > > > > > > > >>>> > > > > > > > >>> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >