Re: Understanding possible synergies between arrow & zarr communities?

Antoine Pitrou Tue, 16 Jul 2024 08:56:12 -0700


Hi Carl,

Le 08/07/2024 à 18:43, Carl Boettiger a écrit :


As an observer to both communities, I'm interested in if there is or might
be more communication between the Pangeo community's focus on Zarr
serialization with what the Arrow team has done with Parquet.  I recognize
that these are different formats that serve different purposes, but frankly
it seems there are a lot of reasonably low-level optimizations in how Arrow
handles range request parsing (data type conversion,
compression/decompression, streaming, much else) on Parquet that I was
wondering might be useful in the Zarr context.

Well, Parquet is a rather sophisticated format and the C++ Parquetimplementation inside PyArrow is not meant for anything else thanreading Parquet files :-) In other words, I'm afraid there's not much toreuse for other purposes there.

This discussion on the Pangeo forum may be illustrative:
https://discourse.pangeo.io/t/why-is-parquet-pandas-faster-than-zarr-xarray-here/2513
.  I don't want to get too caught up in that particular example since I
know the use cases will usually differ, but I think it illustrates only a
slice of the potential differences, mostly at a high level (i.e. overhead
from dask and maybe fsspec use in zarr).

PyArrow uses the Arrow C++ filesystems under the hood (*), which mightbe faster than fsspec in some cases (by virtue of being implementing inC++). However, it's also possible that being implemented in Python wouldallow fsspec to implement more sophisticated optimizations, so this isworth measuring on a case-by-case basis.


(*) https://arrow.apache.org/docs/dev/python/filesystems.html

Feel free to ask any other questions, we're ready to help. If there'senough interest, perhaps we could even schedule a call with variousparties at some time.


Regards

Antoine.

Re: Understanding possible synergies between arrow & zarr communities?

Reply via email to