Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Renjie Liu Thu, 10 Oct 2019 03:42:01 -0700

I've create ticket to track here:
https://issues.apache.org/jira/browse/ARROW-6845


For this moment, can we check in those pregenerated data to unblock rust
version's arrow reader?

On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu <liurenjie2...@gmail.com> wrote:

> It would be fine in that case.
>
> Wes McKinney <wesmck...@gmail.com> 于 2019年10月10日周四 下午12:58写道：
>
>> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <liurenjie2...@gmail.com>
>> wrote:
>> >
>> > 1. There already exists a low level parquet writer which can produce
>> > parquet file, so unit test should be fine. But writer from arrow to
>> parquet
>> > doesn't exist yet, and it may take some period of time to finish it.
>> > 2. In fact my data are randomly generated and it's definitely
>> reproducible.
>> > However, I don't think it would be good idea to randomly generate data
>> > everytime we run ci because it would be difficult to debug. For example
>> PR
>> > a introduced a bug, which is triggerred in other PR's build it would be
>> > confusing for contributors.
>>
>> Presumably any random data generation would use a fixed seed precisely
>> to be reproducible.
>>
>> > 3. I think it would be good idea to spend effort on integration test
>> with
>> > parquet because it's an important use case of arrow. Also similar
>> approach
>> > could be extended to other language and other file format(avro, orc).
>> >
>> >
>> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> >
>> > > There are a number of issues worth discussion.
>> > >
>> > > 1. What is the timeline/plan for Rust implementing a Parquet _writer_?
>> > > It's OK to be reliant on other libraries in the short term to produce
>> > > files to test against, but does not strike me as a sustainable
>> > > long-term plan. Fixing bugs can be a lot more difficult than it needs
>> > > to be if you can't write targeted "endogenous" unit tests
>> > >
>> > > 2. Reproducible data generation
>> > >
>> > > I think if you're going to test against a pre-generated corpus, you
>> > > should make sure that generating the corpus is reproducible for other
>> > > developers (i.e. with a Dockerfile), and can be extended by adding new
>> > > files or random data generation.
>> > >
>> > > I additionally would prefer generating the test corpus at test time
>> > > rather than checking in binary files. If this isn't viable right now
>> > > we can create an "arrow-rust-crutch" git repository for you to stash
>> > > binary files until some of these testing scalability issues are
>> > > addressed.
>> > >
>> > > If we're going to spend energy on Parquet integration testing with
>> > > Java, this would be a good opportunity to do the work in a way where
>> > > the C++ Parquet library can also participate (since we ought to be
>> > > doing integration tests with Java, and we can also read JSON files to
>> > > Arrow).
>> > >
>> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <liurenjie2...@gmail.com>
>> > > wrote:
>> > > >
>> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <andygrov...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > I'm very interested in helping to find a solution to this because
>> we
>> > > really
>> > > > > do need integration tests for Rust to make sure we're compatible
>> with
>> > > other
>> > > > > implementations... there is also the ongoing CI dockerization work
>> > > that I
>> > > > > feel is related.
>> > > > >
>> > > > > I haven't looked at the current integration tests yet and would
>> > > appreciate
>> > > > > some pointers on how all of this works (do we have docs?) or
>> where to
>> > > start
>> > > > > looking.
>> > > > >
>> > > > I have a test in my latest PR:
>> https://github.com/apache/arrow/pull/5523
>> > > > And here is the generated data:
>> > > > https://github.com/apache/arrow-testing/pull/11
>> > > > As with program to generate these data, it's just a simple java
>> program.
>> > > > I'm not sure whether we need to integrate it into arrow.
>> > > >
>> > > > >
>> > > > > I imagine the integration test could follow the approach that
>> Renjie is
>> > > > > outlining where we call Java to generate some files and then call
>> Rust
>> > > to
>> > > > > parse them?
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Andy.
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <
>> liurenjie2...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > > Hi:
>> > > > > >
>> > > > > > I'm developing rust version of reader which reads parquet into
>> arrow
>> > > > > array.
>> > > > > > To verify the correct of this reader, I use the following
>> approach:
>> > > > > >
>> > > > > >
>> > > > > >    1. Define schema with protobuf.
>> > > > > >    2. Generate json data of this schema using other language
>> with
>> > > more
>> > > > > >    sophisticated implementation (e.g. java)
>> > > > > >    3. Generate parquet data of this schema using other language
>> with
>> > > more
>> > > > > >    sophisticated implementation (e.g. java)
>> > > > > >    4. Write tests to read json file, and parquet file into
>> memory
>> > > (arrow
>> > > > > >    array), then compare json data with arrow data.
>> > > > > >
>> > > > > >  I think with this method we can guarantee the correctness of
>> arrow
>> > > > > reader
>> > > > > > because json format is ubiquitous and their implementation are
>> more
>> > > > > stable.
>> > > > > >
>> > > > > > Any comment is appreciated.
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > > Renjie Liu
>> > > > Software Engineer, MVAD
>> > >
>> >
>> >
>> > --
>> > Renjie Liu
>> > Software Engineer, MVAD
>>
>

-- 
Renjie Liu
Software Engineer, MVAD

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Reply via email to