Hi folks, Neal Richardson suggested on the rOpenSci slack I might pose this question to this list.
As an observer to both communities, I'm interested in if there is or might be more communication between the Pangeo community's focus on Zarr serialization with what the Arrow team has done with Parquet. I recognize that these are different formats that serve different purposes, but frankly it seems there are a lot of reasonably low-level optimizations in how Arrow handles range request parsing (data type conversion, compression/decompression, streaming, much else) on Parquet that I was wondering might be useful in the Zarr context. This discussion on the Pangeo forum may be illustrative: https://discourse.pangeo.io/t/why-is-parquet-pandas-faster-than-zarr-xarray-here/2513 . I don't want to get too caught up in that particular example since I know the use cases will usually differ, but I think it illustrates only a slice of the potential differences, mostly at a high level (i.e. overhead from dask and maybe fsspec use in zarr). Thanks for considering, I think y'all probably know some of the zarr devs already. I don't mean to meddle, I just know how easy it is for expertise to become siloed and am always amazed at how much one community can learn from another! Best regards to you all and hugely appreciate your contributions to open source and data science community, Carl --- http://carlboettiger.info