On Thu, 22 Aug 2024 10:08:00 +0100
Raphael Taylor-Davies
<r.taylordav...@googlemail.com.INVALID>
wrote:
> Right, but in a world dominated by networked, commodity object storage, 
> perhaps optimising for efficient transport and decode might be right, 
> and storage efficiency perhaps less important... I think it was the 
> btrblocks paper that came to much the same conclusion.

"Efficient transport" means different things in different situations.
On a very high-speed local network, Arrow IPC may shine. When
transferring data between a datacenter (say over S3) and an analyst or
scientist's workstation or laptop, Parquet should really help make the
transmission faster (and perhaps less expensive in $ terms).

> Anyway my broader point was that the underlying motivation for a lot of 
> this discussion appears to be a desire to make parquet better at 
> workloads it currently struggles with. Metadata is but a part of this, 
> but insufficient for many of the workloads discussed. When one also 
> considers the additional functionality required, e.g. fixed size lists, 
> efficient random-access reads, SIMD-friendly encodings, more 
> expressive/extension types, etc... I start to wonder... If feather added 
> one of the modern delta encodings, I think it would be almost perfect...

There are a lot of dimensions in the domain space, and I'm skeptical
that simply adding delta encodings to the Arrow in-memory format would
really close the gap with Parquet, given that the latter has other
sophisticated encodings, and richer metadata including statistics, page
indices and optional bloom filters.

Also, my intuition is that the investment to migrate Parquet to
Flatbuffers should actually be smaller than the investment to fully
support something like delta encodings in Arrow (without on-the-fly
decoding to the equivalent of PLAIN).

Regards

Antoine.


Reply via email to