Hi Andrew, Thanks for the pointer to the discussion which seems like the best place to continue this thread. The carpenter program proposal does seem like it would have helped here.
On Thu, Jan 30, 2025 at 2:55 AM Andrew Lamb <andrewlam...@gmail.com> wrote: > > This is a great idea. There is a previous discussion about a similar idea > here[1] > > Specifically, I think Alkis's sketch of the "carpenter" program would have > caught this situation. > > In my opinion, improving interoperability testing like this is a key step > towards being able to reliably evolve the Parquet standard itself. > > Andrew > > [1]: https://github.com/apache/parquet-format/issues/441 > > On Wed, Jan 29, 2025 at 3:49 PM Bryce Mecum <bryceme...@gmail.com> wrote: > > > Hello Parquet community, > > > > The Arrow project recently fixed a bug [1] in its C++ Parquet > > implementation that was causing compliant Parquet files written by > > recent versions of parquet-rs [2] to be unreadable by the C++ > > implementation due to differences in the implementation of Parquet’s > > SizeStatistics feature [3]. This also affected the Arrow libraries > > that bind to the C++ implementation, including PyArrow. The C++ > > implementation has been patched [4] and a new Arrow release (19.0.1) > > is in the works. > > > > Given this, I wanted to start a discussion about what kind of > > cross-implementation testing facilities may already exist in any of > > the Parquet implementations and what kind of testing facilities might > > be created to help catch situations like these. > > > > I’ll start off with my thoughts and encourage people to jump in: > > > > 1. The specific integration test that could have been run to catch > > this bug would be a test that used the Arrow 19.0.0 release candidate > > to read any Parquet file written by parquet-rs >=53.0. This would have > > halted the release process. Should the Arrow project just add a CI job > > like this and move on? > > 2. Testing every combination of Parquet format versions, feature > > toggles, implementations, and implementation versions is clearly too > > large a problem to solve so it might be best to start off with a > > narrow scope. > > > > Please note that I've cross-posted this to the Apache Arrow mailing > > list. Please reply to the Apache Parquet post. I’m looking forward to > > hearing others’ thoughts and ideas. > > > > Thanks, > > Bryce > > > > [1] https://github.com/apache/arrow/issues/45283 > > [2] https://github.com/apache/arrow-rs/tree/main/parquet > > [3] https://github.com/apache/parquet-format/pull/197 > > [4] https://github.com/apache/arrow/pull/45285 > >