I think the ideal scenario is to have a mix of "endogenous" unit testing and functional testing against real files to test for regressions or cross-compatibility. To criticize the work we've done in the C++ project, we have not done enough systematic integration testing IMHO, but we do test against some "bad files" that have accumulated.
In any case, I think it's bad practice for a file format reader to rely exclusively on functional testing against static binary files. This good be a good opportunity to devise a language-agnostic Parquet integration testing strategy. Given that we're looking to add nested data support in C++ hopefully by the end of 2020, it would be good timing. On Sat, Oct 12, 2019 at 11:12 AM Andy Grove <andygrov...@gmail.com> wrote: > > I also think that there are valid use cases for checking in binary files, > but we have to be careful not to abuse this. For example, we might want to > check in a Parquet file created by a particular version of Apache Spark to > ensure that Arrow implementations can read it successfully (hypothetical > example). > > It would also be good to have a small set of Parquet files using every > possible data type that all implementations can use in their tests. I > suppose we might want one set per Arrow format version as well. > > The problem we have now, in my opinion, is that we're proposing adding > files on a pretty ad-hoc basis, driven by the needs of individual > contributors in one language implementation, and this is perhaps happening > because we don't already have a good set of standard test files. > > Renjie - perhaps you could comment on this. If we had these standard files > covering all data types, would that have worked for you in this instance? > > Thanks, > > Andy. > > On Sat, Oct 12, 2019 at 12:03 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > Hi Wes, > > > > > > I additionally would prefer generating the test corpus at test time > > > rather than checking in binary files. > > > > > > Can you elaborate on this? I think both generated on the fly and example > > files are useful. > > > > The checked in files catch regressions even when readers/writers can read > > their own data but they have either incorrect or undefined behavior in > > regards to the specification (for example I would imagine checking in a > > file as part of the fix for ARROW-6844 > > <https://issues.apache.org/jira/browse/ARROW-6844>). > > > > Thanks, > > Micah > > > > On Thu, Oct 10, 2019 at 5:30 PM Renjie Liu <liurenjie2...@gmail.com> > > wrote: > > > > > Thanks wes. Sure I'll fix it. > > > > > > Wes McKinney <wesmck...@gmail.com> 于 2019年10月11日周五 上午6:10写道: > > > > > > > I just merged the PR https://github.com/apache/arrow-testing/pull/11 > > > > > > > > Various aspects of this make me uncomfortable so I hope they can be > > > > addressed in follow up work > > > > > > > > On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu <liurenjie2...@gmail.com> > > > > wrote: > > > > > > > > > > I've create ticket to track here: > > > > > https://issues.apache.org/jira/browse/ARROW-6845 > > > > > > > > > > For this moment, can we check in those pregenerated data to unblock > > > rust > > > > > version's arrow reader? > > > > > > > > > > On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu <liurenjie2...@gmail.com> > > > > wrote: > > > > > > > > > > > It would be fine in that case. > > > > > > > > > > > > Wes McKinney <wesmck...@gmail.com> 于 2019年10月10日周四 下午12:58写道: > > > > > > > > > > > >> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu < > > liurenjie2...@gmail.com > > > > > > > > > >> wrote: > > > > > >> > > > > > > >> > 1. There already exists a low level parquet writer which can > > > produce > > > > > >> > parquet file, so unit test should be fine. But writer from arrow > > > to > > > > > >> parquet > > > > > >> > doesn't exist yet, and it may take some period of time to finish > > > it. > > > > > >> > 2. In fact my data are randomly generated and it's definitely > > > > > >> reproducible. > > > > > >> > However, I don't think it would be good idea to randomly > > generate > > > > data > > > > > >> > everytime we run ci because it would be difficult to debug. For > > > > example > > > > > >> PR > > > > > >> > a introduced a bug, which is triggerred in other PR's build it > > > > would be > > > > > >> > confusing for contributors. > > > > > >> > > > > > >> Presumably any random data generation would use a fixed seed > > > precisely > > > > > >> to be reproducible. > > > > > >> > > > > > >> > 3. I think it would be good idea to spend effort on integration > > > test > > > > > >> with > > > > > >> > parquet because it's an important use case of arrow. Also > > similar > > > > > >> approach > > > > > >> > could be extended to other language and other file format(avro, > > > > orc). > > > > > >> > > > > > > >> > > > > > > >> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney < > > wesmck...@gmail.com > > > > > > > > > >> wrote: > > > > > >> > > > > > > >> > > There are a number of issues worth discussion. > > > > > >> > > > > > > > >> > > 1. What is the timeline/plan for Rust implementing a Parquet > > > > _writer_? > > > > > >> > > It's OK to be reliant on other libraries in the short term to > > > > produce > > > > > >> > > files to test against, but does not strike me as a sustainable > > > > > >> > > long-term plan. Fixing bugs can be a lot more difficult than > > it > > > > needs > > > > > >> > > to be if you can't write targeted "endogenous" unit tests > > > > > >> > > > > > > > >> > > 2. Reproducible data generation > > > > > >> > > > > > > > >> > > I think if you're going to test against a pre-generated > > corpus, > > > > you > > > > > >> > > should make sure that generating the corpus is reproducible > > for > > > > other > > > > > >> > > developers (i.e. with a Dockerfile), and can be extended by > > > > adding new > > > > > >> > > files or random data generation. > > > > > >> > > > > > > > >> > > I additionally would prefer generating the test corpus at test > > > > time > > > > > >> > > rather than checking in binary files. If this isn't viable > > right > > > > now > > > > > >> > > we can create an "arrow-rust-crutch" git repository for you to > > > > stash > > > > > >> > > binary files until some of these testing scalability issues > > are > > > > > >> > > addressed. > > > > > >> > > > > > > > >> > > If we're going to spend energy on Parquet integration testing > > > with > > > > > >> > > Java, this would be a good opportunity to do the work in a way > > > > where > > > > > >> > > the C++ Parquet library can also participate (since we ought > > to > > > be > > > > > >> > > doing integration tests with Java, and we can also read JSON > > > > files to > > > > > >> > > Arrow). > > > > > >> > > > > > > > >> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu < > > > > liurenjie2...@gmail.com> > > > > > >> > > wrote: > > > > > >> > > > > > > > > >> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove < > > > > andygrov...@gmail.com> > > > > > >> > > wrote: > > > > > >> > > > > > > > > >> > > > > I'm very interested in helping to find a solution to this > > > > because > > > > > >> we > > > > > >> > > really > > > > > >> > > > > do need integration tests for Rust to make sure we're > > > > compatible > > > > > >> with > > > > > >> > > other > > > > > >> > > > > implementations... there is also the ongoing CI > > > dockerization > > > > work > > > > > >> > > that I > > > > > >> > > > > feel is related. > > > > > >> > > > > > > > > > >> > > > > I haven't looked at the current integration tests yet and > > > > would > > > > > >> > > appreciate > > > > > >> > > > > some pointers on how all of this works (do we have docs?) > > or > > > > > >> where to > > > > > >> > > start > > > > > >> > > > > looking. > > > > > >> > > > > > > > > > >> > > > I have a test in my latest PR: > > > > > >> https://github.com/apache/arrow/pull/5523 > > > > > >> > > > And here is the generated data: > > > > > >> > > > https://github.com/apache/arrow-testing/pull/11 > > > > > >> > > > As with program to generate these data, it's just a simple > > > java > > > > > >> program. > > > > > >> > > > I'm not sure whether we need to integrate it into arrow. > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > I imagine the integration test could follow the approach > > > that > > > > > >> Renjie is > > > > > >> > > > > outlining where we call Java to generate some files and > > then > > > > call > > > > > >> Rust > > > > > >> > > to > > > > > >> > > > > parse them? > > > > > >> > > > > > > > > > >> > > > > Thanks, > > > > > >> > > > > > > > > > >> > > > > Andy. > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu < > > > > > >> liurenjie2...@gmail.com> > > > > > >> > > wrote: > > > > > >> > > > > > > > > > >> > > > > > Hi: > > > > > >> > > > > > > > > > > >> > > > > > I'm developing rust version of reader which reads > > parquet > > > > into > > > > > >> arrow > > > > > >> > > > > array. > > > > > >> > > > > > To verify the correct of this reader, I use the > > following > > > > > >> approach: > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > 1. Define schema with protobuf. > > > > > >> > > > > > 2. Generate json data of this schema using other > > > language > > > > > >> with > > > > > >> > > more > > > > > >> > > > > > sophisticated implementation (e.g. java) > > > > > >> > > > > > 3. Generate parquet data of this schema using other > > > > language > > > > > >> with > > > > > >> > > more > > > > > >> > > > > > sophisticated implementation (e.g. java) > > > > > >> > > > > > 4. Write tests to read json file, and parquet file > > into > > > > > >> memory > > > > > >> > > (arrow > > > > > >> > > > > > array), then compare json data with arrow data. > > > > > >> > > > > > > > > > > >> > > > > > I think with this method we can guarantee the > > correctness > > > > of > > > > > >> arrow > > > > > >> > > > > reader > > > > > >> > > > > > because json format is ubiquitous and their > > implementation > > > > are > > > > > >> more > > > > > >> > > > > stable. > > > > > >> > > > > > > > > > > >> > > > > > Any comment is appreciated. > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > -- > > > > > >> > > > Renjie Liu > > > > > >> > > > Software Engineer, MVAD > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > -- > > > > > >> > Renjie Liu > > > > > >> > Software Engineer, MVAD > > > > > >> > > > > > > > > > > > > > > > > -- > > > > > Renjie Liu > > > > > Software Engineer, MVAD > > > > > > > > >