hi Carl,

I agree that cross-collaboration and knowledge/tools sharing could be very
helpful. Even though we've done a lot of engineering on low-level IO and
memory management, there are probably still many aspects of the Parquet C++
reader (what powers pyarrow.parquet) that could be improved to do better IO
and CPU thread scheduling, as well as performance improvements to
single-threaded Parquet file parsing and materialization itself.

I'm not familiar with the internals of how zarr works, but not being a pure
columnar / tabular file format I would guess that there is some overhead
related to the workloads that zarr is designed for that may be hard to make
go away, but then it looks like zarr is written in pure Python. Has there
been any discussion about rewriting parts of zarr in Rust (for example, the
IO management stack would be a prime candidate for this type of
treatment)?

Polars, DuckDB, and DataFusion also have their own wholly independent IO
and Parquet reading stacks, so these could definitely be pulled in as
comparison points to get other illustrative performance numbers.

Thanks
Wes

On Mon, Jul 8, 2024 at 10:11 PM Carl Boettiger
<cboet...@berkeley.edu.invalid> wrote:

> Hi folks,
>
> Neal Richardson suggested on the rOpenSci slack I might pose this question
> to this list.
>
> As an observer to both communities, I'm interested in if there is or might
> be more communication between the Pangeo community's focus on Zarr
> serialization with what the Arrow team has done with Parquet.  I recognize
> that these are different formats that serve different purposes, but frankly
> it seems there are a lot of reasonably low-level optimizations in how Arrow
> handles range request parsing (data type conversion,
> compression/decompression, streaming, much else) on Parquet that I was
> wondering might be useful in the Zarr context.
>
> This discussion on the Pangeo forum may be illustrative:
>
> https://discourse.pangeo.io/t/why-is-parquet-pandas-faster-than-zarr-xarray-here/2513
> .  I don't want to get too caught up in that particular example since I
> know the use cases will usually differ, but I think it illustrates only a
> slice of the potential differences, mostly at a high level (i.e. overhead
> from dask and maybe fsspec use in zarr).
>
> Thanks for considering, I think y'all probably know some of the zarr devs
> already.  I don't mean to meddle, I just know how easy it is for expertise
> to become siloed and am always amazed at how much one community can learn
> from another!
>
> Best regards to you all and hugely appreciate your contributions to open
> source and data science community,
>
> Carl
>
> ---
> http://carlboettiger.info
>

Reply via email to