Re: [RUST] parquet2 experiment

Jorge Cardoso Leitão Sun, 25 Apr 2021 21:41:26 -0700

Hi,

It now has support to read statistics and experimental support for
delta-encoding (e.g. it can read delta-string-encoded v2 pages created by
spark 3).


(answers inline)

> Note you can also find files generated by other implementations here:
https://github.com/apache/parquet-testing

Thanks! Yes, I have also been using those also. I found myself generating
files on the fly when I must test a specific (encoding, page version,
physical type) combination.

> I wonder what you hope to gain by bringing it to an ASF repo that you
can't get in your own repo?

The apache way of building software collaboratively, including its
governance model (e.g. a PMC). In OSS-driven foundational technologies such
as Arrow and Parquet, players often have competing interests and thus a
governance model that manages this is imo a requirement for success. The IP
clearance is a nice add-on (legally speaking), but not the main driver for
me.

> In principle, I don't see an issue with having a network of
apache/arrow-* git repositories for Rust projects, so if the desire is to
have a new GitHub repository for "revolution" crates (rewrites of more
stable crates) versus the "evolution" crates, I think we could certainly do
that.

Great, me too :-) I created a separate thread on this.

Best,
Jorge


On Sat, Apr 17, 2021 at 12:14 PM Andrew Lamb <al...@influxdata.com> wrote:

> It sounds like exciting work Jorge -- Thank you for the update!
>
> I wonder what you hope to gain by bringing it to an ASF repo that you can't
> get in your own repo?
>
> Perhaps you are ready to bring in other collaborators and wish to ensure
> they have undergone the Apache IP clearance process?
>
> Andrew
>
>
> On Fri, Apr 16, 2021 at 12:22 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Hi,
> >
> > As briefly discussed in a recent email thread, I have been experimenting
> > with re-writing the Rust parquet implementation. I have not advertised
> this
> > much as I was very sceptical that this would work. I am now confident
> that
> > it can, and thus would like to share more details.
> >
> > parquet2 [1] is a rewrite of the parquet crate taking security,
> > performance, and parallelism as requirements.
> >
> > Here are the highlights so far:
> >
> > - Security: *no use of unsafe*. All invariants about memory and thread
> > safety are proven by the Rust compiler (an audit to its 3 mandatory + 5
> > optional compressors is still required). (compare e.g. ARROW-10920).
> >
> > - Performance: to the best of my benchmarking capabilities, *3-15x
> faster*
> > than the parquet crate, both reading and writing to arrow. It has about
> the
> > same performance as pyarrow/c++. These numbers correspond to a single
> plain
> > page with 10% nulls and increase with increasing slot number / page size
> > (which imo is a relevant unit of work). See [2] for plots, numbers and
> > references to exact commits.
> >
> > - Features: it reads parquet optional primitive types, V1 and V2,
> > dictionary- and non-dictionary pages, rep and def levels, and metadata.
> It
> > reads 1-level nullable lists. It writes non-dictionary V1 pages with
> PLAIN
> > and RLE encoding. No delta-encoding yet. No statistics yet.
> >
> > - Integration: it is integration-tested against parquet generated by
> > pyarrow==3, and round trip tests for the write.
> >
> > The public API is just functions and iterators generics. An important
> > design choice is that there is a strict separation between IO-bound
> > operations (read and seek) and CPU-bound operations (decompress, decode,
> > deserialize). This gives consumers (read datafusion, polars, etc.) the
> > choice of deciding how they want to parallelize the work among threads.
> >
> > I investigated async and AFAIU we first need to add support to it on the
> > thrift crate [3], as it currently does not have an API to use the
> > futures::AsyncRead and futures::AsyncSeek traits.
> >
> > parquet2 is in-memory model -independent; it just exposes an API to read
> > the parquet format according to the spec. It delegates to consumers how
> to
> > deserialize the pages to it (I implemented it for arrow2 and native
> rust),
> > offering a toolkit to help them. imo this is important because imo it
> > should be the in-memory representation to decide how to best convert a
> > decompressed page to memory.
> >
> > The development is happening on my own repo, but I was hoping to bring it
> > to ASF (experimental repo?). if you think that Apache Arrow could be a
> > place to host it (Apache Parquet is another option?).
> >
> > [1] https://github.com/jorgecarleitao/parquet2
> > [2]
> >
> >
> https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=0
> > [3] https://issues.apache.org/jira/browse/THRIFT-4777
> >
> > Best,
> > Jorge
> >
>

Re: [RUST] parquet2 experiment

Reply via email to