Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Wes McKinney Wed, 09 Oct 2019 08:09:37 -0700

There are a number of issues worth discussion.

1. What is the timeline/plan for Rust implementing a Parquet _writer_?
It's OK to be reliant on other libraries in the short term to produce
files to test against, but does not strike me as a sustainable
long-term plan. Fixing bugs can be a lot more difficult than it needs
to be if you can't write targeted "endogenous" unit tests


2. Reproducible data generation

I think if you're going to test against a pre-generated corpus, you
should make sure that generating the corpus is reproducible for other
developers (i.e. with a Dockerfile), and can be extended by adding new
files or random data generation.

I additionally would prefer generating the test corpus at test time
rather than checking in binary files. If this isn't viable right now
we can create an "arrow-rust-crutch" git repository for you to stash
binary files until some of these testing scalability issues are
addressed.

If we're going to spend energy on Parquet integration testing with
Java, this would be a good opportunity to do the work in a way where
the C++ Parquet library can also participate (since we ought to be
doing integration tests with Java, and we can also read JSON files to
Arrow).

On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <liurenjie2...@gmail.com> wrote:
>
> On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <andygrov...@gmail.com> wrote:
>
> > I'm very interested in helping to find a solution to this because we really
> > do need integration tests for Rust to make sure we're compatible with other
> > implementations... there is also the ongoing CI dockerization work that I
> > feel is related.
> >
> > I haven't looked at the current integration tests yet and would appreciate
> > some pointers on how all of this works (do we have docs?) or where to start
> > looking.
> >
> I have a test in my latest PR: https://github.com/apache/arrow/pull/5523
> And here is the generated data:
> https://github.com/apache/arrow-testing/pull/11
> As with program to generate these data, it's just a simple java program.
> I'm not sure whether we need to integrate it into arrow.
>
> >
> > I imagine the integration test could follow the approach that Renjie is
> > outlining where we call Java to generate some files and then call Rust to
> > parse them?
> >
> > Thanks,
> >
> > Andy.
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <liurenjie2...@gmail.com> wrote:
> >
> > > Hi:
> > >
> > > I'm developing rust version of reader which reads parquet into arrow
> > array.
> > > To verify the correct of this reader, I use the following approach:
> > >
> > >
> > >    1. Define schema with protobuf.
> > >    2. Generate json data of this schema using other language with more
> > >    sophisticated implementation (e.g. java)
> > >    3. Generate parquet data of this schema using other language with more
> > >    sophisticated implementation (e.g. java)
> > >    4. Write tests to read json file, and parquet file into memory (arrow
> > >    array), then compare json data with arrow data.
> > >
> > >  I think with this method we can guarantee the correctness of arrow
> > reader
> > > because json format is ubiquitous and their implementation are more
> > stable.
> > >
> > > Any comment is appreciated.
> > >
> >
>
>
> --
> Renjie Liu
> Software Engineer, MVAD

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Reply via email to