I think getting something setup, initially focused on variant (or geometry)
and then expanding it over time makes lots of sense to me

Andrew

On Fri, Feb 14, 2025 at 5:36 PM Bryce Mecum <bryceme...@gmail.com> wrote:

> Hi Gang, that does seem like a good idea. Would there be any benefit
> to trying that with the active spec changes like GEOMETRY/GEOGRAPHY or
> VARIANT?
>
> On Wed, Feb 5, 2025 at 9:14 PM Gang Wu <ust...@gmail.com> wrote:
> >
> > As the troublemaker of the mentioned issue above, I'd say that
> > a lesson learned is that we should publish example files for any
> > new feature to the parquet-testing [1] repo for interoperability tests.
> > Perhaps we need a staging repo/branch to store produced files
> > during the active development. This may help catch common issues
> > as early as possible.
> >
> > [1] https://github.com/apache/parquet-testing
> >
> > Best,
> > Gang
> >
> > On Thu, Jan 30, 2025 at 6:55 PM Andrew Lamb <andrewlam...@gmail.com>
> wrote:
> >
> > > This is a great idea. There is a previous discussion about a similar
> idea
> > > here[1]
> > >
> > > Specifically, I think Alkis's sketch of the "carpenter" program would
> have
> > > caught this situation.
> > >
> > > In my opinion, improving interoperability testing like this is a key
> step
> > > towards being able to  reliably evolve the Parquet standard itself.
> > >
> > > Andrew
> > >
> > > [1]: https://github.com/apache/parquet-format/issues/441
> > >
> > > On Wed, Jan 29, 2025 at 3:49 PM Bryce Mecum <bryceme...@gmail.com>
> wrote:
> > >
> > > > Hello Parquet community,
> > > >
> > > > The Arrow project recently fixed a bug [1] in its C++ Parquet
> > > > implementation that was causing compliant Parquet files written by
> > > > recent versions of parquet-rs [2] to be unreadable by the C++
> > > > implementation due to differences in the implementation of Parquet’s
> > > > SizeStatistics feature [3]. This also affected the Arrow libraries
> > > > that bind to the C++ implementation, including PyArrow. The C++
> > > > implementation has been patched [4] and a new Arrow release (19.0.1)
> > > > is in the works.
> > > >
> > > > Given this, I wanted to start a discussion about what kind of
> > > > cross-implementation testing facilities may already exist in any of
> > > > the Parquet implementations and what kind of testing facilities might
> > > > be created to help catch situations like these.
> > > >
> > > > I’ll start off with my thoughts and encourage people to jump in:
> > > >
> > > > 1. The specific integration test that could have been run to catch
> > > > this bug would be a test that used the Arrow 19.0.0 release candidate
> > > > to read any Parquet file written by parquet-rs >=53.0. This would
> have
> > > > halted the release process. Should the Arrow project just add a CI
> job
> > > > like this and move on?
> > > > 2. Testing every combination of Parquet format versions, feature
> > > > toggles, implementations, and implementation versions is clearly too
> > > > large a problem to solve so it might be best to start off with a
> > > > narrow scope.
> > > >
> > > > Please note that I've cross-posted this to the Apache Arrow mailing
> > > > list. Please reply to the Apache Parquet post. I’m looking forward to
> > > > hearing others’ thoughts and ideas.
> > > >
> > > > Thanks,
> > > > Bryce
> > > >
> > > > [1] https://github.com/apache/arrow/issues/45283
> > > > [2] https://github.com/apache/arrow-rs/tree/main/parquet
> > > > [3] https://github.com/apache/parquet-format/pull/197
> > > > [4] https://github.com/apache/arrow/pull/45285
> > > >
> > >
>

Reply via email to