> Has there been any discussion about rewriting parts of zarr in Rust (for example, the > IO management stack would be a prime candidate for this type of > treatment)?
One project that might be interesting from the DataFusion community is [1] which is a native Rust implementation of reading/writing the zar format into arrow (from which you could make a Pandas dataframe, for example). I haven't used it myself, but it might be worth evaluating. Andrew [1]: https://github.com/datafusion-contrib/arrow-zarr On Tue, Jul 16, 2024 at 11:57 AM Antoine Pitrou <anto...@python.org> wrote: > > Hi Carl, > > Le 08/07/2024 à 18:43, Carl Boettiger a écrit : > > > > As an observer to both communities, I'm interested in if there is or > might > > be more communication between the Pangeo community's focus on Zarr > > serialization with what the Arrow team has done with Parquet. I > recognize > > that these are different formats that serve different purposes, but > frankly > > it seems there are a lot of reasonably low-level optimizations in how > Arrow > > handles range request parsing (data type conversion, > > compression/decompression, streaming, much else) on Parquet that I was > > wondering might be useful in the Zarr context. > > Well, Parquet is a rather sophisticated format and the C++ Parquet > implementation inside PyArrow is not meant for anything else than > reading Parquet files :-) In other words, I'm afraid there's not much to > reuse for other purposes there. > > > This discussion on the Pangeo forum may be illustrative: > > > https://discourse.pangeo.io/t/why-is-parquet-pandas-faster-than-zarr-xarray-here/2513 > > . I don't want to get too caught up in that particular example since I > > know the use cases will usually differ, but I think it illustrates only a > > slice of the potential differences, mostly at a high level (i.e. overhead > > from dask and maybe fsspec use in zarr). > > PyArrow uses the Arrow C++ filesystems under the hood (*), which might > be faster than fsspec in some cases (by virtue of being implementing in > C++). However, it's also possible that being implemented in Python would > allow fsspec to implement more sophisticated optimizations, so this is > worth measuring on a case-by-case basis. > > (*) https://arrow.apache.org/docs/dev/python/filesystems.html > > Feel free to ask any other questions, we're ready to help. If there's > enough interest, perhaps we could even schedule a call with various > parties at some time. > > Regards > > Antoine. >