Re: [RUST] parquet2 experiment

Benjamin Blodgett Sat, 17 Apr 2021 15:43:15 -0700

That sounds like a great way to frame that and solve that issue!

Sent from my iPhone


> On Apr 17, 2021, at 3:01 PM, Wes McKinney <[email protected]> wrote:
> 
> In principle, I don't see an issue with having a network of
> apache/arrow-* git repositories for Rust projects, so if the desire is
> to have a new GitHub repository for "revolution" crates (rewrites of
> more stable crates) versus the "evolution" crates, I think we could
> certainly do that.
> 
>> On Sat, Apr 17, 2021 at 2:03 PM Evan Chan <[email protected]> wrote:
>> 
>> This sounds like really awesome work!
>> 
>> If it is in its own repo, would that mean the current implementation in 
>> Arrow would just be left there?
>> Good parquet support seems really important to have.
>> 
>> Evan
>> 
>>>> On Apr 17, 2021, at 3:14 AM, Andrew Lamb <[email protected]> wrote:
>>> 
>>> It sounds like exciting work Jorge -- Thank you for the update!
>>> 
>>> I wonder what you hope to gain by bringing it to an ASF repo that you can't
>>> get in your own repo?
>>> 
>>> Perhaps you are ready to bring in other collaborators and wish to ensure
>>> they have undergone the Apache IP clearance process?
>>> 
>>> Andrew
>>> 
>>> 
>>> On Fri, Apr 16, 2021 at 12:22 PM Jorge Cardoso Leitão <
>>> [email protected]> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> As briefly discussed in a recent email thread, I have been experimenting
>>>> with re-writing the Rust parquet implementation. I have not advertised this
>>>> much as I was very sceptical that this would work. I am now confident that
>>>> it can, and thus would like to share more details.
>>>> 
>>>> parquet2 [1] is a rewrite of the parquet crate taking security,
>>>> performance, and parallelism as requirements.
>>>> 
>>>> Here are the highlights so far:
>>>> 
>>>> - Security: *no use of unsafe*. All invariants about memory and thread
>>>> safety are proven by the Rust compiler (an audit to its 3 mandatory + 5
>>>> optional compressors is still required). (compare e.g. ARROW-10920).
>>>> 
>>>> - Performance: to the best of my benchmarking capabilities, *3-15x faster*
>>>> than the parquet crate, both reading and writing to arrow. It has about the
>>>> same performance as pyarrow/c++. These numbers correspond to a single plain
>>>> page with 10% nulls and increase with increasing slot number / page size
>>>> (which imo is a relevant unit of work). See [2] for plots, numbers and
>>>> references to exact commits.
>>>> 
>>>> - Features: it reads parquet optional primitive types, V1 and V2,
>>>> dictionary- and non-dictionary pages, rep and def levels, and metadata. It
>>>> reads 1-level nullable lists. It writes non-dictionary V1 pages with PLAIN
>>>> and RLE encoding. No delta-encoding yet. No statistics yet.
>>>> 
>>>> - Integration: it is integration-tested against parquet generated by
>>>> pyarrow==3, and round trip tests for the write.
>>>> 
>>>> The public API is just functions and iterators generics. An important
>>>> design choice is that there is a strict separation between IO-bound
>>>> operations (read and seek) and CPU-bound operations (decompress, decode,
>>>> deserialize). This gives consumers (read datafusion, polars, etc.) the
>>>> choice of deciding how they want to parallelize the work among threads.
>>>> 
>>>> I investigated async and AFAIU we first need to add support to it on the
>>>> thrift crate [3], as it currently does not have an API to use the
>>>> futures::AsyncRead and futures::AsyncSeek traits.
>>>> 
>>>> parquet2 is in-memory model -independent; it just exposes an API to read
>>>> the parquet format according to the spec. It delegates to consumers how to
>>>> deserialize the pages to it (I implemented it for arrow2 and native rust),
>>>> offering a toolkit to help them. imo this is important because imo it
>>>> should be the in-memory representation to decide how to best convert a
>>>> decompressed page to memory.
>>>> 
>>>> The development is happening on my own repo, but I was hoping to bring it
>>>> to ASF (experimental repo?). if you think that Apache Arrow could be a
>>>> place to host it (Apache Parquet is another option?).
>>>> 
>>>> [1] https://github.com/jorgecarleitao/parquet2
>>>> [2]
>>>> 
>>>> https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=0
>>>> [3] https://issues.apache.org/jira/browse/THRIFT-4777
>>>> 
>>>> Best,
>>>> Jorge
>>>> 
>>

Re: [RUST] parquet2 experiment

Reply via email to