Re: Understanding possible synergies between arrow & zarr communities?

Andrew Lamb Wed, 17 Jul 2024 03:12:28 -0700

> Has there been any discussion about rewriting parts of zarr in Rust (for
example, the
> IO management stack would be a prime candidate for this type of
> treatment)?


One project that might be interesting from the DataFusion community is [1]
which is a native Rust implementation of reading/writing the zar format
into arrow (from which you could make a Pandas dataframe, for example). I
haven't used it myself, but it might be worth evaluating.

Andrew

[1]: https://github.com/datafusion-contrib/arrow-zarr

On Tue, Jul 16, 2024 at 11:57 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Hi Carl,
>
> Le 08/07/2024 à 18:43, Carl Boettiger a écrit :
> >
> > As an observer to both communities, I'm interested in if there is or
> might
> > be more communication between the Pangeo community's focus on Zarr
> > serialization with what the Arrow team has done with Parquet.  I
> recognize
> > that these are different formats that serve different purposes, but
> frankly
> > it seems there are a lot of reasonably low-level optimizations in how
> Arrow
> > handles range request parsing (data type conversion,
> > compression/decompression, streaming, much else) on Parquet that I was
> > wondering might be useful in the Zarr context.
>
> Well, Parquet is a rather sophisticated format and the C++ Parquet
> implementation inside PyArrow is not meant for anything else than
> reading Parquet files :-) In other words, I'm afraid there's not much to
> reuse for other purposes there.
>
> > This discussion on the Pangeo forum may be illustrative:
> >
> https://discourse.pangeo.io/t/why-is-parquet-pandas-faster-than-zarr-xarray-here/2513
> > .  I don't want to get too caught up in that particular example since I
> > know the use cases will usually differ, but I think it illustrates only a
> > slice of the potential differences, mostly at a high level (i.e. overhead
> > from dask and maybe fsspec use in zarr).
>
> PyArrow uses the Arrow C++ filesystems under the hood (*), which might
> be faster than fsspec in some cases (by virtue of being implementing in
> C++). However, it's also possible that being implemented in Python would
> allow fsspec to implement more sophisticated optimizations, so this is
> worth measuring on a case-by-case basis.
>
> (*) https://arrow.apache.org/docs/dev/python/filesystems.html
>
> Feel free to ask any other questions, we're ready to help. If there's
> enough interest, perhaps we could even schedule a call with various
> parties at some time.
>
> Regards
>
> Antoine.
>

Re: Understanding possible synergies between arrow & zarr communities?

Reply via email to