Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Wes McKinney Sat, 12 Oct 2019 13:25:50 -0700

I think the ideal scenario is to have a mix of "endogenous" unit
testing and functional testing against real files to test for
regressions or cross-compatibility. To criticize the work we've done
in the C++ project, we have not done enough systematic integration
testing IMHO, but we do test against some "bad files" that have
accumulated.


In any case, I think it's bad practice for a file format reader to
rely exclusively on functional testing against static binary files.

This good be a good opportunity to devise a language-agnostic Parquet
integration testing strategy. Given that we're looking to add nested
data support in C++ hopefully by the end of 2020, it would be good
timing.

On Sat, Oct 12, 2019 at 11:12 AM Andy Grove <andygrov...@gmail.com> wrote:
>
> I also think that there are valid use cases for checking in binary files,
> but we have to be careful not to abuse this. For example, we might want to
> check in a Parquet file created by a particular version of Apache Spark to
> ensure that Arrow implementations can read it successfully (hypothetical
> example).
>
> It would also be good to have a small set of Parquet files using every
> possible data type that all implementations can use in their tests. I
> suppose we might want one set per Arrow format version as well.
>
> The problem we have now, in my opinion, is that we're proposing adding
> files on a pretty ad-hoc basis, driven by the needs of individual
> contributors in one language implementation, and this is perhaps happening
> because we don't already have a good set of standard test files.
>
> Renjie - perhaps you could comment on this. If we had these standard files
> covering all data types, would that have worked for you in this instance?
>
> Thanks,
>
> Andy.
>
> On Sat, Oct 12, 2019 at 12:03 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > Hi Wes,
> > >
> > > I additionally would prefer generating the test corpus at test time
> > > rather than checking in binary files.
> >
> >
> > Can you elaborate on this? I think both generated on the fly and example
> > files are useful.
> >
> > The checked in files catch regressions even when readers/writers can read
> > their own data but they have either incorrect or undefined behavior in
> > regards to the specification (for example I would imagine checking in a
> > file as part of the fix for ARROW-6844
> > <https://issues.apache.org/jira/browse/ARROW-6844>).
> >
> > Thanks,
> > Micah
> >
> > On Thu, Oct 10, 2019 at 5:30 PM Renjie Liu <liurenjie2...@gmail.com>
> > wrote:
> >
> > > Thanks wes. Sure I'll fix it.
> > >
> > > Wes McKinney <wesmck...@gmail.com> 于 2019年10月11日周五 上午6:10写道：
> > >
> > > > I just merged the PR https://github.com/apache/arrow-testing/pull/11
> > > >
> > > > Various aspects of this make me uncomfortable so I hope they can be
> > > > addressed in follow up work
> > > >
> > > > On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu <liurenjie2...@gmail.com>
> > > > wrote:
> > > > >
> > > > > I've create ticket to track here:
> > > > > https://issues.apache.org/jira/browse/ARROW-6845
> > > > >
> > > > > For this moment, can we check in those pregenerated data to unblock
> > > rust
> > > > > version's arrow reader?
> > > > >
> > > > > On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu <liurenjie2...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > It would be fine in that case.
> > > > > >
> > > > > > Wes McKinney <wesmck...@gmail.com> 于 2019年10月10日周四 下午12:58写道：
> > > > > >
> > > > > >> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <
> > liurenjie2...@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > 1. There already exists a low level parquet writer which can
> > > produce
> > > > > >> > parquet file, so unit test should be fine. But writer from arrow
> > > to
> > > > > >> parquet
> > > > > >> > doesn't exist yet, and it may take some period of time to finish
> > > it.
> > > > > >> > 2. In fact my data are randomly generated and it's definitely
> > > > > >> reproducible.
> > > > > >> > However, I don't think it would be good idea to randomly
> > generate
> > > > data
> > > > > >> > everytime we run ci because it would be difficult to debug. For
> > > > example
> > > > > >> PR
> > > > > >> > a introduced a bug, which is triggerred in other PR's build it
> > > > would be
> > > > > >> > confusing for contributors.
> > > > > >>
> > > > > >> Presumably any random data generation would use a fixed seed
> > > precisely
> > > > > >> to be reproducible.
> > > > > >>
> > > > > >> > 3. I think it would be good idea to spend effort on integration
> > > test
> > > > > >> with
> > > > > >> > parquet because it's an important use case of arrow. Also
> > similar
> > > > > >> approach
> > > > > >> > could be extended to other language and other file format(avro,
> > > > orc).
> > > > > >> >
> > > > > >> >
> > > > > >> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <
> > wesmck...@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > There are a number of issues worth discussion.
> > > > > >> > >
> > > > > >> > > 1. What is the timeline/plan for Rust implementing a Parquet
> > > > _writer_?
> > > > > >> > > It's OK to be reliant on other libraries in the short term to
> > > > produce
> > > > > >> > > files to test against, but does not strike me as a sustainable
> > > > > >> > > long-term plan. Fixing bugs can be a lot more difficult than
> > it
> > > > needs
> > > > > >> > > to be if you can't write targeted "endogenous" unit tests
> > > > > >> > >
> > > > > >> > > 2. Reproducible data generation
> > > > > >> > >
> > > > > >> > > I think if you're going to test against a pre-generated
> > corpus,
> > > > you
> > > > > >> > > should make sure that generating the corpus is reproducible
> > for
> > > > other
> > > > > >> > > developers (i.e. with a Dockerfile), and can be extended by
> > > > adding new
> > > > > >> > > files or random data generation.
> > > > > >> > >
> > > > > >> > > I additionally would prefer generating the test corpus at test
> > > > time
> > > > > >> > > rather than checking in binary files. If this isn't viable
> > right
> > > > now
> > > > > >> > > we can create an "arrow-rust-crutch" git repository for you to
> > > > stash
> > > > > >> > > binary files until some of these testing scalability issues
> > are
> > > > > >> > > addressed.
> > > > > >> > >
> > > > > >> > > If we're going to spend energy on Parquet integration testing
> > > with
> > > > > >> > > Java, this would be a good opportunity to do the work in a way
> > > > where
> > > > > >> > > the C++ Parquet library can also participate (since we ought
> > to
> > > be
> > > > > >> > > doing integration tests with Java, and we can also read JSON
> > > > files to
> > > > > >> > > Arrow).
> > > > > >> > >
> > > > > >> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <
> > > > liurenjie2...@gmail.com>
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <
> > > > andygrov...@gmail.com>
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > > > I'm very interested in helping to find a solution to this
> > > > because
> > > > > >> we
> > > > > >> > > really
> > > > > >> > > > > do need integration tests for Rust to make sure we're
> > > > compatible
> > > > > >> with
> > > > > >> > > other
> > > > > >> > > > > implementations... there is also the ongoing CI
> > > dockerization
> > > > work
> > > > > >> > > that I
> > > > > >> > > > > feel is related.
> > > > > >> > > > >
> > > > > >> > > > > I haven't looked at the current integration tests yet and
> > > > would
> > > > > >> > > appreciate
> > > > > >> > > > > some pointers on how all of this works (do we have docs?)
> > or
> > > > > >> where to
> > > > > >> > > start
> > > > > >> > > > > looking.
> > > > > >> > > > >
> > > > > >> > > > I have a test in my latest PR:
> > > > > >> https://github.com/apache/arrow/pull/5523
> > > > > >> > > > And here is the generated data:
> > > > > >> > > > https://github.com/apache/arrow-testing/pull/11
> > > > > >> > > > As with program to generate these data, it's just a simple
> > > java
> > > > > >> program.
> > > > > >> > > > I'm not sure whether we need to integrate it into arrow.
> > > > > >> > > >
> > > > > >> > > > >
> > > > > >> > > > > I imagine the integration test could follow the approach
> > > that
> > > > > >> Renjie is
> > > > > >> > > > > outlining where we call Java to generate some files and
> > then
> > > > call
> > > > > >> Rust
> > > > > >> > > to
> > > > > >> > > > > parse them?
> > > > > >> > > > >
> > > > > >> > > > > Thanks,
> > > > > >> > > > >
> > > > > >> > > > > Andy.
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <
> > > > > >> liurenjie2...@gmail.com>
> > > > > >> > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Hi:
> > > > > >> > > > > >
> > > > > >> > > > > > I'm developing rust version of reader which reads
> > parquet
> > > > into
> > > > > >> arrow
> > > > > >> > > > > array.
> > > > > >> > > > > > To verify the correct of this reader, I use the
> > following
> > > > > >> approach:
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > >    1. Define schema with protobuf.
> > > > > >> > > > > >    2. Generate json data of this schema using other
> > > language
> > > > > >> with
> > > > > >> > > more
> > > > > >> > > > > >    sophisticated implementation (e.g. java)
> > > > > >> > > > > >    3. Generate parquet data of this schema using other
> > > > language
> > > > > >> with
> > > > > >> > > more
> > > > > >> > > > > >    sophisticated implementation (e.g. java)
> > > > > >> > > > > >    4. Write tests to read json file, and parquet file
> > into
> > > > > >> memory
> > > > > >> > > (arrow
> > > > > >> > > > > >    array), then compare json data with arrow data.
> > > > > >> > > > > >
> > > > > >> > > > > >  I think with this method we can guarantee the
> > correctness
> > > > of
> > > > > >> arrow
> > > > > >> > > > > reader
> > > > > >> > > > > > because json format is ubiquitous and their
> > implementation
> > > > are
> > > > > >> more
> > > > > >> > > > > stable.
> > > > > >> > > > > >
> > > > > >> > > > > > Any comment is appreciated.
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > --
> > > > > >> > > > Renjie Liu
> > > > > >> > > > Software Engineer, MVAD
> > > > > >> > >
> > > > > >> >
> > > > > >> >
> > > > > >> > --
> > > > > >> > Renjie Liu
> > > > > >> > Software Engineer, MVAD
> > > > > >>
> > > > > >
> > > > >
> > > > > --
> > > > > Renjie Liu
> > > > > Software Engineer, MVAD
> > > >
> > >
> >

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Reply via email to