That sounds like a great way to frame that and solve that issue! Sent from my iPhone
> On Apr 17, 2021, at 3:01 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > In principle, I don't see an issue with having a network of > apache/arrow-* git repositories for Rust projects, so if the desire is > to have a new GitHub repository for "revolution" crates (rewrites of > more stable crates) versus the "evolution" crates, I think we could > certainly do that. > >> On Sat, Apr 17, 2021 at 2:03 PM Evan Chan <e...@urbanlogiq.com> wrote: >> >> This sounds like really awesome work! >> >> If it is in its own repo, would that mean the current implementation in >> Arrow would just be left there? >> Good parquet support seems really important to have. >> >> Evan >> >>>> On Apr 17, 2021, at 3:14 AM, Andrew Lamb <al...@influxdata.com> wrote: >>> >>> It sounds like exciting work Jorge -- Thank you for the update! >>> >>> I wonder what you hope to gain by bringing it to an ASF repo that you can't >>> get in your own repo? >>> >>> Perhaps you are ready to bring in other collaborators and wish to ensure >>> they have undergone the Apache IP clearance process? >>> >>> Andrew >>> >>> >>> On Fri, Apr 16, 2021 at 12:22 PM Jorge Cardoso Leitão < >>> jorgecarlei...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> As briefly discussed in a recent email thread, I have been experimenting >>>> with re-writing the Rust parquet implementation. I have not advertised this >>>> much as I was very sceptical that this would work. I am now confident that >>>> it can, and thus would like to share more details. >>>> >>>> parquet2 [1] is a rewrite of the parquet crate taking security, >>>> performance, and parallelism as requirements. >>>> >>>> Here are the highlights so far: >>>> >>>> - Security: *no use of unsafe*. All invariants about memory and thread >>>> safety are proven by the Rust compiler (an audit to its 3 mandatory + 5 >>>> optional compressors is still required). (compare e.g. ARROW-10920). >>>> >>>> - Performance: to the best of my benchmarking capabilities, *3-15x faster* >>>> than the parquet crate, both reading and writing to arrow. It has about the >>>> same performance as pyarrow/c++. These numbers correspond to a single plain >>>> page with 10% nulls and increase with increasing slot number / page size >>>> (which imo is a relevant unit of work). See [2] for plots, numbers and >>>> references to exact commits. >>>> >>>> - Features: it reads parquet optional primitive types, V1 and V2, >>>> dictionary- and non-dictionary pages, rep and def levels, and metadata. It >>>> reads 1-level nullable lists. It writes non-dictionary V1 pages with PLAIN >>>> and RLE encoding. No delta-encoding yet. No statistics yet. >>>> >>>> - Integration: it is integration-tested against parquet generated by >>>> pyarrow==3, and round trip tests for the write. >>>> >>>> The public API is just functions and iterators generics. An important >>>> design choice is that there is a strict separation between IO-bound >>>> operations (read and seek) and CPU-bound operations (decompress, decode, >>>> deserialize). This gives consumers (read datafusion, polars, etc.) the >>>> choice of deciding how they want to parallelize the work among threads. >>>> >>>> I investigated async and AFAIU we first need to add support to it on the >>>> thrift crate [3], as it currently does not have an API to use the >>>> futures::AsyncRead and futures::AsyncSeek traits. >>>> >>>> parquet2 is in-memory model -independent; it just exposes an API to read >>>> the parquet format according to the spec. It delegates to consumers how to >>>> deserialize the pages to it (I implemented it for arrow2 and native rust), >>>> offering a toolkit to help them. imo this is important because imo it >>>> should be the in-memory representation to decide how to best convert a >>>> decompressed page to memory. >>>> >>>> The development is happening on my own repo, but I was hoping to bring it >>>> to ASF (experimental repo?). if you think that Apache Arrow could be a >>>> place to host it (Apache Parquet is another option?). >>>> >>>> [1] https://github.com/jorgecarleitao/parquet2 >>>> [2] >>>> >>>> https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=0 >>>> [3] https://issues.apache.org/jira/browse/THRIFT-4777 >>>> >>>> Best, >>>> Jorge >>>> >>