On Thu, 22 Aug 2024 10:08:00 +0100 Raphael Taylor-Davies <r.taylordav...@googlemail.com.INVALID> wrote: > Right, but in a world dominated by networked, commodity object storage, > perhaps optimising for efficient transport and decode might be right, > and storage efficiency perhaps less important... I think it was the > btrblocks paper that came to much the same conclusion.
"Efficient transport" means different things in different situations. On a very high-speed local network, Arrow IPC may shine. When transferring data between a datacenter (say over S3) and an analyst or scientist's workstation or laptop, Parquet should really help make the transmission faster (and perhaps less expensive in $ terms). > Anyway my broader point was that the underlying motivation for a lot of > this discussion appears to be a desire to make parquet better at > workloads it currently struggles with. Metadata is but a part of this, > but insufficient for many of the workloads discussed. When one also > considers the additional functionality required, e.g. fixed size lists, > efficient random-access reads, SIMD-friendly encodings, more > expressive/extension types, etc... I start to wonder... If feather added > one of the modern delta encodings, I think it would be almost perfect... There are a lot of dimensions in the domain space, and I'm skeptical that simply adding delta encodings to the Arrow in-memory format would really close the gap with Parquet, given that the latter has other sophisticated encodings, and richer metadata including statistics, page indices and optional bloom filters. Also, my intuition is that the investment to migrate Parquet to Flatbuffers should actually be smaller than the investment to fully support something like delta encodings in Arrow (without on-the-fly decoding to the equivalent of PLAIN). Regards Antoine.