Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Renjie Liu Wed, 09 Oct 2019 20:16:49 -0700

1. There already exists a low level parquet writer which can produce
parquet file, so unit test should be fine. But writer from arrow to parquet
doesn't exist yet, and it may take some period of time to finish it.
2. In fact my data are randomly generated and it's definitely reproducible.
However, I don't think it would be good idea to randomly generate data
everytime we run ci because it would be difficult to debug. For example PR
a introduced a bug, which is triggerred in other PR's build it would be
confusing for contributors.
3. I think it would be good idea to spend effort on integration test with
parquet because it's an important use case of arrow. Also similar approach
could be extended to other language and other file format(avro, orc).



On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <wesmck...@gmail.com> wrote:

> There are a number of issues worth discussion.
>
> 1. What is the timeline/plan for Rust implementing a Parquet _writer_?
> It's OK to be reliant on other libraries in the short term to produce
> files to test against, but does not strike me as a sustainable
> long-term plan. Fixing bugs can be a lot more difficult than it needs
> to be if you can't write targeted "endogenous" unit tests
>
> 2. Reproducible data generation
>
> I think if you're going to test against a pre-generated corpus, you
> should make sure that generating the corpus is reproducible for other
> developers (i.e. with a Dockerfile), and can be extended by adding new
> files or random data generation.
>
> I additionally would prefer generating the test corpus at test time
> rather than checking in binary files. If this isn't viable right now
> we can create an "arrow-rust-crutch" git repository for you to stash
> binary files until some of these testing scalability issues are
> addressed.
>
> If we're going to spend energy on Parquet integration testing with
> Java, this would be a good opportunity to do the work in a way where
> the C++ Parquet library can also participate (since we ought to be
> doing integration tests with Java, and we can also read JSON files to
> Arrow).
>
> On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <liurenjie2...@gmail.com>
> wrote:
> >
> > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <andygrov...@gmail.com>
> wrote:
> >
> > > I'm very interested in helping to find a solution to this because we
> really
> > > do need integration tests for Rust to make sure we're compatible with
> other
> > > implementations... there is also the ongoing CI dockerization work
> that I
> > > feel is related.
> > >
> > > I haven't looked at the current integration tests yet and would
> appreciate
> > > some pointers on how all of this works (do we have docs?) or where to
> start
> > > looking.
> > >
> > I have a test in my latest PR: https://github.com/apache/arrow/pull/5523
> > And here is the generated data:
> > https://github.com/apache/arrow-testing/pull/11
> > As with program to generate these data, it's just a simple java program.
> > I'm not sure whether we need to integrate it into arrow.
> >
> > >
> > > I imagine the integration test could follow the approach that Renjie is
> > > outlining where we call Java to generate some files and then call Rust
> to
> > > parse them?
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <liurenjie2...@gmail.com>
> wrote:
> > >
> > > > Hi:
> > > >
> > > > I'm developing rust version of reader which reads parquet into arrow
> > > array.
> > > > To verify the correct of this reader, I use the following approach:
> > > >
> > > >
> > > >    1. Define schema with protobuf.
> > > >    2. Generate json data of this schema using other language with
> more
> > > >    sophisticated implementation (e.g. java)
> > > >    3. Generate parquet data of this schema using other language with
> more
> > > >    sophisticated implementation (e.g. java)
> > > >    4. Write tests to read json file, and parquet file into memory
> (arrow
> > > >    array), then compare json data with arrow data.
> > > >
> > > >  I think with this method we can guarantee the correctness of arrow
> > > reader
> > > > because json format is ubiquitous and their implementation are more
> > > stable.
> > > >
> > > > Any comment is appreciated.
> > > >
> > >
> >
> >
> > --
> > Renjie Liu
> > Software Engineer, MVAD
>


-- 
Renjie Liu
Software Engineer, MVAD

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Reply via email to