> FWIW the tool is python, so I use pyarrow when generating numbers. I haven't yet tested to see how well the results translate to other writers.
Would be curious about this too. On Wed, May 28, 2025 at 12:28 PM Ed Seidl <etse...@apache.org> wrote: > Yes, right now we're targeting pyarrow and parquet-cpp, but will add > parquet-rs soon too. We haven't used parquet-java for quite a while, so > I've lost track of the possible configs there. > > All columns get PLAIN and DICTIONARY encoding, and then I'll add in other > encodings based on the physical type of the column. Other than that, there > are no other heuristics, but there are C/L flags to limit the test space > (can select only certain columns and cut down on compression codecs for > instance). > > FWIW the tool is python, so I use pyarrow when generating numbers. I > haven't yet tested to see how well the results translate to other writers. > > On 2025/05/28 19:13:41 Ashish Singh wrote: > > > What my tool does is, for a given input parquet file and for each > column, > > cycle through all combinations of column encoding, column compression, > and > > max dictionary size. When it's done the optimal settings (to minimize > file > > size) for those are given for each column, along with code snippets to > set > > them (either pyarrow or parquet-cpp at the moment). > > > > Thanks Ed. Do you cycle through all possible configs for an input file or > > do you also use some heuristics to narrow the search space? The per > > column compression tuning seems to be not achievable on parquet-java > > currently, sounds like your use-case is primarily on pyarrow and > > parquet-cpp? > > > > > > On Wed, May 28, 2025 at 11:37 AM Ed Seidl <etse...@apache.org> wrote: > > > > > What my tool does is, for a given input parquet file and for each > column, > > > cycle through all combinations of column encoding, column compression, > and > > > max dictionary size. When it's done the optimal settings (to minimize > file > > > size) for those are given for each column, along with code snippets to > set > > > them (either pyarrow or parquet-cpp at the moment). > > > > > > In the past I've done a little tuning work on row group/page size for > > > point lookup on hdfs, but that was all manual. > > > > > > Ed > > > > > > On 2025/05/28 17:58:38 Ashish Singh wrote: > > > > We typically aim at 800 Mbs file sizes for object stores. However, > we are > > > > not interested in changing file content or size as part of the > parquet > > > > tuning. We simply want to optimize the content of file to optimize > for a > > > > particular resource like, storage size, read speed, write speed, etc. > > > > > > > > > > > > On Wed, May 28, 2025 at 10:27 AM Adrian Garcia Badaracco < > > > > adr...@adriangb.com> wrote: > > > > > > > > > I’ve often seen 100MB as a “reasonable” default choice. But I don’t > > > have a > > > > > lot of data to substantiate that. On our system we’ve found that > > > smaller > > > > > (e.g. 5MB) leads to too much overhead and larger (e.g. 5GB) leads > to > > > OOMs, > > > > > too much overhead parsing footers / stats even if you’re only > going to > > > read > > > > > a couple rows, etc. > > > > > > > > > > > On May 28, 2025, at 12:23 PM, Steve Loughran > > > <ste...@cloudera.com.invalid> > > > > > wrote: > > > > > > > > > > > > ?interesting q here. > > > > > > > > > > > > TPC benchmarks do give different numbers for different file > sizes, > > > > > > independent of the nominal TPC scale (e.g different values for > 10TB > > > > > > numbers, with everything else the same) > > > > > > > > > > > > I know it's all so dependent on cluster, app etc -but what sizes > do > > > > > people > > > > > > use in (a) benchmarks and (b) production datasets? Or at least: > what > > > > > > minimum sizes show up as very inefficient, what large sizes seem > to > > > show > > > > > no > > > > > > incremental benefit . > > > > > > > > > > > > The minimum size is going to be so significant for distributed > > > engines > > > > > like > > > > > > Spark, as there's the work setup costs, but so does using cloud > > > storage > > > > > as > > > > > > the data lake -there's overhead in simply opening files and > reading > > > > > footers > > > > > > which will penalise the files. Parquet through DuckDb is > inevitably > > > going > > > > > > to be very different > > > > > > > > > > > > papers with empirical data welcome.. > > > > > > > > > > > > > > > > > > Steve > > > > > > > > > > > > > > > > > > On Wed, 28 May 2025 at 17:52, Ashish Singh < > > > singhashish....@gmail.com> > > > > > > wrote: > > > > > > > > > > > >> Thanks all! > > > > > >> > > > > > >> Yea, I am mostly looking at available tooling to tune parquet > files. > > > > > >> > > > > > >> Ed, I would be interested to discuss this. Would you (or anyone > > > else) > > > > > like > > > > > >> to have a dedicated discussion on this? To provide some > context, at > > > > > >> Pinterest we are actively looking into adopting/ building such > > > tooling. > > > > > We, > > > > > >> like others, have been traditionally relying on manual tuning so > > > far, > > > > > which > > > > > >> isn't really scalable. > > > > > >> > > > > > >> Best Regards, > > > > > >> Ashish > > > > > >> > > > > > >> > > > > > >> On Wed, May 28, 2025 at 9:29 AM Ed Seidl <etse...@apache.org> > > > wrote: > > > > > >> > > > > > >>> I'm developing such a tool for my own use. Right now it only > > > optimizes > > > > > >> for > > > > > >>> size, but I'm planning to add query time later. I'm trying to > get > > > it > > > > > open > > > > > >>> sourced, but the wheels of bureaucracy turn slowly :( > > > > > >>> > > > > > >>> Ed > > > > > >>> > > > > > >>> On 2025/05/28 15:36:37 Martin Loncaric wrote: > > > > > >>>> I think Ashish's question was about determining the right > > > > > configuration > > > > > >>> in > > > > > >>>> the first place - IIUC parquet-rewrite requires the user to > pass > > > these > > > > > >>> in. > > > > > >>>> > > > > > >>>> I'm not aware of any tool to choose good Parquet > configurations > > > > > >>>> automatically. I sometimes use the parquet-tools pip package / > > > CLI to > > > > > >>>> inspect Parquet and see how files are configured, but I've > only > > > tuned > > > > > >>>> manually. > > > > > >>>> > > > > > >>>> On Tue, May 27, 2025, 16:22 Andrew Lamb < > andrewlam...@gmail.com> > > > > > >> wrote: > > > > > >>>> > > > > > >>>>> We have one in the arrow-rs repository: parquet-rewrite[1] > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> [1]: > > > > > >>>>> > > > > > >>>>> > > > > > >>> > > > > > >> > > > > > > > > > https://github.com/apache/arrow-rs/blob/0da003becbd6489f483b70e5914a242edd8c6d1a/parquet/src/bin/parquet-rewrite.rs#L18 > > > > > >>>>> > > > > > >>>>> On Tue, May 27, 2025 at 12:41 PM Ashish Singh < > asi...@apache.org > > > > > > > > > >>> wrote: > > > > > >>>>> > > > > > >>>>>> Hey all, > > > > > >>>>>> > > > > > >>>>>> Is there any tool/ lib folks use to tune parquet configs to > > > > > >> optimize > > > > > >>> for > > > > > >>>>>> storage size / read/ write speed? > > > > > >>>>>> > > > > > >>>>>> - Ashish > > > > > >>>>>> > > > > > >>>>> > > > > > >>>> > > > > > >>> > > > > > >> > > > > > > > > > > > > > > > > > > > >