>
> Also, my intuition is that the investment to migrate Parquet to
> Flatbuffers should actually be smaller than the investment to fully
> support something like delta encodings in Arrow (without on-the-fly
> decoding to the equivalent of PLAIN).
>

Fully agree on this.

Right, but in a world dominated by networked, commodity object storage,
> perhaps optimising for efficient transport and decode might be right,
> and storage efficiency perhaps less important... I think it was the
> btrblocks paper that came to much the same conclusion.
>

Yes, fast decompression speed is important too, but high compression ratios
still matter. In a perfect world, you would want both at the same time.
Btrblocks also compresses heavily, it is far away from Arrow. Also note
that storage cost is still a thing. With immutable formats like Iceberg and
DeltaLake that store even already deleted data of older versions, you rack
up a lot of data.

We probably have exabytes of data lying around. It does make a difference
to your object store bill
whether you store 1 or 3 exabytes...

Cheers
Jan

Am Do., 22. Aug. 2024 um 14:54 Uhr schrieb Antoine Pitrou <
anto...@python.org>:

> On Thu, 22 Aug 2024 10:08:00 +0100
> Raphael Taylor-Davies
> <r.taylordav...@googlemail.com.INVALID>
> wrote:
> > Right, but in a world dominated by networked, commodity object storage,
> > perhaps optimising for efficient transport and decode might be right,
> > and storage efficiency perhaps less important... I think it was the
> > btrblocks paper that came to much the same conclusion.
>
> "Efficient transport" means different things in different situations.
> On a very high-speed local network, Arrow IPC may shine. When
> transferring data between a datacenter (say over S3) and an analyst or
> scientist's workstation or laptop, Parquet should really help make the
> transmission faster (and perhaps less expensive in $ terms).
>
> > Anyway my broader point was that the underlying motivation for a lot of
> > this discussion appears to be a desire to make parquet better at
> > workloads it currently struggles with. Metadata is but a part of this,
> > but insufficient for many of the workloads discussed. When one also
> > considers the additional functionality required, e.g. fixed size lists,
> > efficient random-access reads, SIMD-friendly encodings, more
> > expressive/extension types, etc... I start to wonder... If feather added
> > one of the modern delta encodings, I think it would be almost perfect...
>
> There are a lot of dimensions in the domain space, and I'm skeptical
> that simply adding delta encodings to the Arrow in-memory format would
> really close the gap with Parquet, given that the latter has other
> sophisticated encodings, and richer metadata including statistics, page
> indices and optional bloom filters.
>
> Also, my intuition is that the investment to migrate Parquet to
> Flatbuffers should actually be smaller than the investment to fully
> support something like delta encodings in Arrow (without on-the-fly
> decoding to the equivalent of PLAIN).
>
> Regards
>
> Antoine.
>
>
>

Reply via email to