Hi Carl,

Le 08/07/2024 à 18:43, Carl Boettiger a écrit :

As an observer to both communities, I'm interested in if there is or might
be more communication between the Pangeo community's focus on Zarr
serialization with what the Arrow team has done with Parquet.  I recognize
that these are different formats that serve different purposes, but frankly
it seems there are a lot of reasonably low-level optimizations in how Arrow
handles range request parsing (data type conversion,
compression/decompression, streaming, much else) on Parquet that I was
wondering might be useful in the Zarr context.

Well, Parquet is a rather sophisticated format and the C++ Parquet implementation inside PyArrow is not meant for anything else than reading Parquet files :-) In other words, I'm afraid there's not much to reuse for other purposes there.

This discussion on the Pangeo forum may be illustrative:
https://discourse.pangeo.io/t/why-is-parquet-pandas-faster-than-zarr-xarray-here/2513
.  I don't want to get too caught up in that particular example since I
know the use cases will usually differ, but I think it illustrates only a
slice of the potential differences, mostly at a high level (i.e. overhead
from dask and maybe fsspec use in zarr).

PyArrow uses the Arrow C++ filesystems under the hood (*), which might be faster than fsspec in some cases (by virtue of being implementing in C++). However, it's also possible that being implemented in Python would allow fsspec to implement more sophisticated optimizations, so this is worth measuring on a case-by-case basis.

(*) https://arrow.apache.org/docs/dev/python/filesystems.html

Feel free to ask any other questions, we're ready to help. If there's enough interest, perhaps we could even schedule a call with various parties at some time.

Regards

Antoine.

Reply via email to