> > Also, my intuition is that the investment to migrate Parquet to > Flatbuffers should actually be smaller than the investment to fully > support something like delta encodings in Arrow (without on-the-fly > decoding to the equivalent of PLAIN). >
Fully agree on this. Right, but in a world dominated by networked, commodity object storage, > perhaps optimising for efficient transport and decode might be right, > and storage efficiency perhaps less important... I think it was the > btrblocks paper that came to much the same conclusion. > Yes, fast decompression speed is important too, but high compression ratios still matter. In a perfect world, you would want both at the same time. Btrblocks also compresses heavily, it is far away from Arrow. Also note that storage cost is still a thing. With immutable formats like Iceberg and DeltaLake that store even already deleted data of older versions, you rack up a lot of data. We probably have exabytes of data lying around. It does make a difference to your object store bill whether you store 1 or 3 exabytes... Cheers Jan Am Do., 22. Aug. 2024 um 14:54 Uhr schrieb Antoine Pitrou < anto...@python.org>: > On Thu, 22 Aug 2024 10:08:00 +0100 > Raphael Taylor-Davies > <r.taylordav...@googlemail.com.INVALID> > wrote: > > Right, but in a world dominated by networked, commodity object storage, > > perhaps optimising for efficient transport and decode might be right, > > and storage efficiency perhaps less important... I think it was the > > btrblocks paper that came to much the same conclusion. > > "Efficient transport" means different things in different situations. > On a very high-speed local network, Arrow IPC may shine. When > transferring data between a datacenter (say over S3) and an analyst or > scientist's workstation or laptop, Parquet should really help make the > transmission faster (and perhaps less expensive in $ terms). > > > Anyway my broader point was that the underlying motivation for a lot of > > this discussion appears to be a desire to make parquet better at > > workloads it currently struggles with. Metadata is but a part of this, > > but insufficient for many of the workloads discussed. When one also > > considers the additional functionality required, e.g. fixed size lists, > > efficient random-access reads, SIMD-friendly encodings, more > > expressive/extension types, etc... I start to wonder... If feather added > > one of the modern delta encodings, I think it would be almost perfect... > > There are a lot of dimensions in the domain space, and I'm skeptical > that simply adding delta encodings to the Arrow in-memory format would > really close the gap with Parquet, given that the latter has other > sophisticated encodings, and richer metadata including statistics, page > indices and optional bloom filters. > > Also, my intuition is that the investment to migrate Parquet to > Flatbuffers should actually be smaller than the investment to fully > support something like delta encodings in Arrow (without on-the-fly > decoding to the equivalent of PLAIN). > > Regards > > Antoine. > > >