I've create ticket to track here: https://issues.apache.org/jira/browse/ARROW-6845
For this moment, can we check in those pregenerated data to unblock rust version's arrow reader? On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu <liurenjie2...@gmail.com> wrote: > It would be fine in that case. > > Wes McKinney <wesmck...@gmail.com> 于 2019年10月10日周四 下午12:58写道: > >> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <liurenjie2...@gmail.com> >> wrote: >> > >> > 1. There already exists a low level parquet writer which can produce >> > parquet file, so unit test should be fine. But writer from arrow to >> parquet >> > doesn't exist yet, and it may take some period of time to finish it. >> > 2. In fact my data are randomly generated and it's definitely >> reproducible. >> > However, I don't think it would be good idea to randomly generate data >> > everytime we run ci because it would be difficult to debug. For example >> PR >> > a introduced a bug, which is triggerred in other PR's build it would be >> > confusing for contributors. >> >> Presumably any random data generation would use a fixed seed precisely >> to be reproducible. >> >> > 3. I think it would be good idea to spend effort on integration test >> with >> > parquet because it's an important use case of arrow. Also similar >> approach >> > could be extended to other language and other file format(avro, orc). >> > >> > >> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <wesmck...@gmail.com> >> wrote: >> > >> > > There are a number of issues worth discussion. >> > > >> > > 1. What is the timeline/plan for Rust implementing a Parquet _writer_? >> > > It's OK to be reliant on other libraries in the short term to produce >> > > files to test against, but does not strike me as a sustainable >> > > long-term plan. Fixing bugs can be a lot more difficult than it needs >> > > to be if you can't write targeted "endogenous" unit tests >> > > >> > > 2. Reproducible data generation >> > > >> > > I think if you're going to test against a pre-generated corpus, you >> > > should make sure that generating the corpus is reproducible for other >> > > developers (i.e. with a Dockerfile), and can be extended by adding new >> > > files or random data generation. >> > > >> > > I additionally would prefer generating the test corpus at test time >> > > rather than checking in binary files. If this isn't viable right now >> > > we can create an "arrow-rust-crutch" git repository for you to stash >> > > binary files until some of these testing scalability issues are >> > > addressed. >> > > >> > > If we're going to spend energy on Parquet integration testing with >> > > Java, this would be a good opportunity to do the work in a way where >> > > the C++ Parquet library can also participate (since we ought to be >> > > doing integration tests with Java, and we can also read JSON files to >> > > Arrow). >> > > >> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <liurenjie2...@gmail.com> >> > > wrote: >> > > > >> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <andygrov...@gmail.com> >> > > wrote: >> > > > >> > > > > I'm very interested in helping to find a solution to this because >> we >> > > really >> > > > > do need integration tests for Rust to make sure we're compatible >> with >> > > other >> > > > > implementations... there is also the ongoing CI dockerization work >> > > that I >> > > > > feel is related. >> > > > > >> > > > > I haven't looked at the current integration tests yet and would >> > > appreciate >> > > > > some pointers on how all of this works (do we have docs?) or >> where to >> > > start >> > > > > looking. >> > > > > >> > > > I have a test in my latest PR: >> https://github.com/apache/arrow/pull/5523 >> > > > And here is the generated data: >> > > > https://github.com/apache/arrow-testing/pull/11 >> > > > As with program to generate these data, it's just a simple java >> program. >> > > > I'm not sure whether we need to integrate it into arrow. >> > > > >> > > > > >> > > > > I imagine the integration test could follow the approach that >> Renjie is >> > > > > outlining where we call Java to generate some files and then call >> Rust >> > > to >> > > > > parse them? >> > > > > >> > > > > Thanks, >> > > > > >> > > > > Andy. >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu < >> liurenjie2...@gmail.com> >> > > wrote: >> > > > > >> > > > > > Hi: >> > > > > > >> > > > > > I'm developing rust version of reader which reads parquet into >> arrow >> > > > > array. >> > > > > > To verify the correct of this reader, I use the following >> approach: >> > > > > > >> > > > > > >> > > > > > 1. Define schema with protobuf. >> > > > > > 2. Generate json data of this schema using other language >> with >> > > more >> > > > > > sophisticated implementation (e.g. java) >> > > > > > 3. Generate parquet data of this schema using other language >> with >> > > more >> > > > > > sophisticated implementation (e.g. java) >> > > > > > 4. Write tests to read json file, and parquet file into >> memory >> > > (arrow >> > > > > > array), then compare json data with arrow data. >> > > > > > >> > > > > > I think with this method we can guarantee the correctness of >> arrow >> > > > > reader >> > > > > > because json format is ubiquitous and their implementation are >> more >> > > > > stable. >> > > > > > >> > > > > > Any comment is appreciated. >> > > > > > >> > > > > >> > > > >> > > > >> > > > -- >> > > > Renjie Liu >> > > > Software Engineer, MVAD >> > > >> > >> > >> > -- >> > Renjie Liu >> > Software Engineer, MVAD >> > -- Renjie Liu Software Engineer, MVAD