I second Weston's comments. The idea of separate files is part of the de jure spec but not the de facto one. It's up to the parquet community whether the de facto spec should be "altered" . Afaik, zero oss readers support use of this field.
On Wed, May 18, 2022, 8:53 AM Weston Pace <weston.p...@gmail.com> wrote: > I can try and clarify my earlier feedback: > > This is an Arrow datasets question if your goal is to create multiple > independent parquet files, each one a complete file, and read them as > a combined dataset. > This is not an Arrow question (but instead a parquet question) if your > goal is to create a single "parquet object" that is read as a single > unit by a parquet reader, but spans multiple physical files. > > Your PR appears to be addressing the second case. I think Micah's > feedback is correct in this case. You should bring this up on the > parquet list and make sure there is some interest in adding this > feature in other implementations first. Otherwise there is a risk you > will create these "parquet objects" that can only be read by a single > implementation of the parquet reader. > > On Wed, May 18, 2022 at 3:55 AM Jeszy <jes...@gmail.com> wrote: > > > > Hello, > > > > I wanted to circle back to this topic and make sure there's a decision > > by the community. Although there was sporadic discussion over jira[1], > > the PR[2], and this list[3] in the past, the messaging across these > > channels changed over time. E.g. while the PR comment is negative, the > > much more recent Jira comment suggests a way forward - is this feature > > likely to gain consensus if following those comments? I'd like to > > avoid wasted effort. > > > > An upside not explicitly mentioned before is that while appending to a > > multi-file Parquet dataset is already possible, as Weston mentioned, > > being able to append row groups and rewrite the footer would enable > > the file's (dataset's) overall metadata to be maintained as new data > > is added. > > > > Thanks, > > Balazs > > > > --- > > [1] https://issues.apache.org/jira/browse/ARROW-11465 > > [2] https://github.com/apache/arrow/pull/8130 > > [3] https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw >