Understanding possible synergies between arrow & zarr communities?

Carl Boettiger Mon, 08 Jul 2024 20:11:08 -0700

Hi folks,

Neal Richardson suggested on the rOpenSci slack I might pose this question
to this list.


As an observer to both communities, I'm interested in if there is or might
be more communication between the Pangeo community's focus on Zarr
serialization with what the Arrow team has done with Parquet.  I recognize
that these are different formats that serve different purposes, but frankly
it seems there are a lot of reasonably low-level optimizations in how Arrow
handles range request parsing (data type conversion,
compression/decompression, streaming, much else) on Parquet that I was
wondering might be useful in the Zarr context.

This discussion on the Pangeo forum may be illustrative:
https://discourse.pangeo.io/t/why-is-parquet-pandas-faster-than-zarr-xarray-here/2513
.  I don't want to get too caught up in that particular example since I
know the use cases will usually differ, but I think it illustrates only a
slice of the potential differences, mostly at a high level (i.e. overhead
from dask and maybe fsspec use in zarr).

Thanks for considering, I think y'all probably know some of the zarr devs
already.  I don't mean to meddle, I just know how easy it is for expertise
to become siloed and am always amazed at how much one community can learn
from another!

Best regards to you all and hugely appreciate your contributions to open
source and data science community,

Carl

---
http://carlboettiger.info

Understanding possible synergies between arrow & zarr communities?

Reply via email to