Hi, It now has support to read statistics and experimental support for delta-encoding (e.g. it can read delta-string-encoded v2 pages created by spark 3).
(answers inline) > Note you can also find files generated by other implementations here: https://github.com/apache/parquet-testing Thanks! Yes, I have also been using those also. I found myself generating files on the fly when I must test a specific (encoding, page version, physical type) combination. > I wonder what you hope to gain by bringing it to an ASF repo that you can't get in your own repo? The apache way of building software collaboratively, including its governance model (e.g. a PMC). In OSS-driven foundational technologies such as Arrow and Parquet, players often have competing interests and thus a governance model that manages this is imo a requirement for success. The IP clearance is a nice add-on (legally speaking), but not the main driver for me. > In principle, I don't see an issue with having a network of apache/arrow-* git repositories for Rust projects, so if the desire is to have a new GitHub repository for "revolution" crates (rewrites of more stable crates) versus the "evolution" crates, I think we could certainly do that. Great, me too :-) I created a separate thread on this. Best, Jorge On Sat, Apr 17, 2021 at 12:14 PM Andrew Lamb <al...@influxdata.com> wrote: > It sounds like exciting work Jorge -- Thank you for the update! > > I wonder what you hope to gain by bringing it to an ASF repo that you can't > get in your own repo? > > Perhaps you are ready to bring in other collaborators and wish to ensure > they have undergone the Apache IP clearance process? > > Andrew > > > On Fri, Apr 16, 2021 at 12:22 PM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > Hi, > > > > As briefly discussed in a recent email thread, I have been experimenting > > with re-writing the Rust parquet implementation. I have not advertised > this > > much as I was very sceptical that this would work. I am now confident > that > > it can, and thus would like to share more details. > > > > parquet2 [1] is a rewrite of the parquet crate taking security, > > performance, and parallelism as requirements. > > > > Here are the highlights so far: > > > > - Security: *no use of unsafe*. All invariants about memory and thread > > safety are proven by the Rust compiler (an audit to its 3 mandatory + 5 > > optional compressors is still required). (compare e.g. ARROW-10920). > > > > - Performance: to the best of my benchmarking capabilities, *3-15x > faster* > > than the parquet crate, both reading and writing to arrow. It has about > the > > same performance as pyarrow/c++. These numbers correspond to a single > plain > > page with 10% nulls and increase with increasing slot number / page size > > (which imo is a relevant unit of work). See [2] for plots, numbers and > > references to exact commits. > > > > - Features: it reads parquet optional primitive types, V1 and V2, > > dictionary- and non-dictionary pages, rep and def levels, and metadata. > It > > reads 1-level nullable lists. It writes non-dictionary V1 pages with > PLAIN > > and RLE encoding. No delta-encoding yet. No statistics yet. > > > > - Integration: it is integration-tested against parquet generated by > > pyarrow==3, and round trip tests for the write. > > > > The public API is just functions and iterators generics. An important > > design choice is that there is a strict separation between IO-bound > > operations (read and seek) and CPU-bound operations (decompress, decode, > > deserialize). This gives consumers (read datafusion, polars, etc.) the > > choice of deciding how they want to parallelize the work among threads. > > > > I investigated async and AFAIU we first need to add support to it on the > > thrift crate [3], as it currently does not have an API to use the > > futures::AsyncRead and futures::AsyncSeek traits. > > > > parquet2 is in-memory model -independent; it just exposes an API to read > > the parquet format according to the spec. It delegates to consumers how > to > > deserialize the pages to it (I implemented it for arrow2 and native > rust), > > offering a toolkit to help them. imo this is important because imo it > > should be the in-memory representation to decide how to best convert a > > decompressed page to memory. > > > > The development is happening on my own repo, but I was hoping to bring it > > to ASF (experimental repo?). if you think that Apache Arrow could be a > > place to host it (Apache Parquet is another option?). > > > > [1] https://github.com/jorgecarleitao/parquet2 > > [2] > > > > > https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=0 > > [3] https://issues.apache.org/jira/browse/THRIFT-4777 > > > > Best, > > Jorge > > >