Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Felipe Oliveira Carvalho Tue, 16 Dec 2025 18:59:34 -0800

Please don't interpret this as a harsh comment, I mean well and don't speak
for the whole community.


>  I think we would need a "semantic schema" or "logical schema" which
indicates the logical type but not the physical representation.

This separation between logical and physical gets mentioned a lot, but even
if the type-representation problem is solved, the real challenge is having
compute kernels that are complete so they can be flexible enough to handle
all the physical forms of a logical type like "string" or have a smart
enough function dispatcher that performs casts on-the-fly as a non-ideal
fallback (e.g. expanding FSST-encoded strings to a more basic and widely
supported format when the compute kernel is more naive).

That is what makes introducing these ideas to Arrow so tricky. Arrow being
the data format that shines in interoperability can't have unbounded
complexity. The definition of any logical type can only be coupled to a
compute system. Datafusion could define all the encodings that form a
logical type, but it's really hard to specify what a logical type is meant
to be on every compute system using Arrow. Even saying that the "string"
type should include REE-encoded and dictionary-encoded strings is a stretch
today. Things start to break when you start exporting arrays to more naive
compute systems that don't support these encodings yet. If you care less
about multiple implementations of a format, interoperability, and can
provide all compute functions in your closed system, then you can greatly
expand the formats and encodings you support. But it will be unrealistic to
expect more than one implementation of such system.

> Arrow users typically expect to be able to perform operations like
"slice" and "take" which require some knowledge of the underlying type.

Exactly. There are many simplifying assumptions that can be made when one
gets an Arrow RecordBatch and wants to do something with it directly (i.e.
without using a library of compute kernels). It's already challenging
enough to get people to stop converting columnar data to row-based arrays.

My recommendation is that we start thinking about proposing formats and
logical types to "compute systems" and not to the "arrow data format'. IMO
"late materialization" doesn't make sense as an Arrow specification. Unless
it's a series of widely useful canonical extension types expressible
through other storage types like binary or fixed-size-binary.

A compute system (e.g. Datafusion) that intends to implement late
materialization would have to expand operand representation a bit to take
in non-materialized data handles in ways that aren't expressible in the
open Arrow format.

> Do you think we would come up with a semi-opaque array that could be
sliced?  Or that we would introduce the concept of an unsliceable array?

The String/BinaryView format, when sliced doesn't necessarily stop carrying
the data buffers. A semi-opaque format could be designed while allowing
slicing, Things get trickier when concatenating arrays (which is also an
operation that is meant to discard unnecessary data when it reallocates
buffers) and exporting arrays through the C Data Interface without leaking
data that is not necessary in a slice.

--
Felipe




On Thu, Dec 11, 2025 at 10:33 AM Weston Pace <[email protected]> wrote:

> I think this is a very interesting idea.  This could potentially open up
> the door for things like adding compute kernels for these compressed
> representations to Arrow or Datafusion.  Though it isn't without some
> challenges.
>
> > It seems FSSTStringVector/Array could potentially be modelled
> > as an extension type
> > ...
> > This would however require a fixed dictionary, so might not
> > be desirable.
> > ...
> > ALPFloatingPointVector and bit-packed vectors/arrays are more challenging
> > to represent as extension types.
> > ...
> > Each batch of values has a different metadata parameter set.
>
> I think these are basically the same problem.  From what I've seen in
> implementations a format will typically introduce some kind of small batch
> concept (I think every 1024 values in the Fast Lanes paper IIRC).  So
> either we need individual record batches for each small batch (in which
> case the Arrow representation is more straightforward but batches are quite
> small) or we need some concept of a batched array in Arrow.  If we want
> individual record batches for each small batch that requires the batch
> sizes to be consistent (in number of rows) between columns and I don't know
> if that's always true.
>
> > One of the discussion items is to allow late materialization: to allow
> > keeping data in encoded format beyond the filter stage (for example in
> > Datafusion).
>
> > Vortex seems to show that it is possible to support advanced
> > encodings (like ALP, FSST, or others) by separating the logical type
> > from the physical encoding.
>
> Pierre brings up another challenge in achieving this goal, which may be
> more significant.  The compression and encoding techniques typically vary
> from page to page within Parquet (this is even more true in formats like
> fast lanes and vortex).  A column might use ALP for one page and then use
> PLAIN encoding for the next page.  This makes it difficult to represent a
> stream of data with the typical Arrow schema we have today.  I think we
> would need a "semantic schema" or "logical schema" which indicates the
> logical type but not the physical representation.  Still, that can be an
> orthogonal discussion to FSST and ALP representation.
>
> >  We could also experiment with Opaque vectors.
>
> This could be an interesting approach too.  I don't know if they could be
> entirely opaque though.  Arrow users typically expect to be able to perform
> operations like "slice" and "take" which require some knowledge of the
> underlying type.  Do you think we would come up with a semi-opaque array
> that could be sliced?  Or that we would introduce the concept of an
> unsliceable array?
>
>
> On Thu, Dec 11, 2025 at 5:27 AM Pierre Lacave <[email protected]> wrote:
>
> > Hi all,
> >
> > I am relatively new to this space, so I apologize if I am missing some
> > context or history here. I wanted to share some observations from what I
> > see happening with projects like Vortex.
> >
> > Vortex seems to show that it is possible to support advanced encodings
> > (like ALP, FSST, or others) by separating the logical type from the
> > physical encoding. If the consumer engine supports the advanced encoding,
> > it stays compressed and fast. If not, the data is "canonicalized" to
> > standard Arrow arrays at the edge.
> >
> > As Parquet adopts these novel encodings, the current Arrow approach
> forces
> > us to "densify" or decompress data immediately, even if the engine could
> > have operated on the encoded data.
> >
> > Is there a world where Arrow could offer some sort of negotiation
> > mechanism? The goal would be to guarantee the data can always be read as
> > standard "safe" physical types (paying a cost only at the boundary),
> while
> > allowing systems that understand the advanced encoding to let the data
> flow
> > through efficiently.
> >
> > This sounds like it keep the safety of the interoperability - Arrow
> making
> > sure new encodings have a canonical representation - and it leave the
> onus
> > of implemented the efficient flow to the consumer - decoupling efficiency
> > from interoperability.
> >
> > Thanks !
> >
> > Pierre
> >
> > On 2025/12/11 06:49:30 Micah Kornfield wrote:
> > > I think this is an interesting idea.  Julien, do you have a proposal
> for
> > > scope?  Is the intent to be 1:1 with any new encoding that is added to
> > > Parquet?  For instance would the intent be to also put cascading
> > encodings
> > > in Arrow?
> > >
> > > We could also experiment with Opaque vectors.
> > >
> > >
> > > Did you mean this as a new type? I think this would be necessary for
> ALP.
> > >
> > > It seems FSSTStringVector/Array could potentially be modelled as an
> > > extension type (dictionary stored as part of the type metadata?) on top
> > of
> > > a byte array. This would however require a fixed dictionary, so might
> not
> > > be desirable.
> > >
> > > ALPFloatingPointVector and bit-packed vectors/arrays are more
> challenging
> > > to represent as extension types.
> > >
> > > 1.  There is no natural alignment with any of the existing types (and
> the
> > > bit-packing width can effectively vary by batch).
> > > 2.  Each batch of values has a different metadata parameter set.
> > >
> > > So it seems there is no easy way out for the ALP encoding and we either
> > > need to pay the cost of adding a new type (which is not necessarily
> > > trivial) or we would have to do some work to literally make a new
> opaque
> > > "Custom" Type, which would have a buffer that is only interpretable
> based
> > > on its extension type.  An easy way of shoe-horning this in would be to
> > add
> > > a ParquetScalar extension type, which simply contains the decompressed
> > but
> > > encoded Parquet page with repetition and definition levels stripped
> out.
> > > The latter also has its obvious down-sides.
> > >
> > > Cheers,
> > > Micah
> > >
> > > [1] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L160
> > > [2] https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
> > >
> > > On Wed, Dec 10, 2025 at 5:44 PM Julien Le Dem <[email protected]>
> wrote:
> > >
> > > > I forgot to mention that those encodings have the particularity of
> > allowing
> > > > random access without decoding previous values.
> > > >
> > > > On Wed, Dec 10, 2025 at 5:40 PM Julien Le Dem <[email protected]>
> > wrote:
> > > >
> > > > > Hello,
> > > > > Parquet is in the process of adopting new encodings [1] (Currently
> > in POC
> > > > > stage), specifically ALP [2] and FSST [3].
> > > > > One of the discussion items is to allow late materialization: to
> > allow
> > > > > keeping data in encoded format beyond the filter stage (for example
> > in
> > > > > Datafusion).
> > > > > There are several advantages to this:
> > > > > - For example, if I summarize FSST as a variation of dictionary
> > encoding
> > > > > on substrings in the values, one can evaluate some operations on
> > encoded
> > > > > values without decoding them, saving memory and CPU.
> > > > > - Similarly, simplifying for brevity, ALP converts floating point
> > values
> > > > > to small integers that are then bitpacked.
> > > > > The Vortex project argues that keeping encoded values in in-memory
> > > > vectors
> > > > > opens up opportunities for performance improvements. [4] a third
> > party
> > > > blog
> > > > > argues it's a problem as well [5]
> > > > >
> > > > > So I wanted to start a discussion to suggest, we might consider
> > adding
> > > > > some additional vectors to support such encoded Values like an
> > > > > FSSTStringVector for example. This would not be too different from
> > the
> > > > > dictionary encoding, or an ALPFloatingPointVector with a bit packed
> > > > scheme
> > > > > not too different from what we use for nullability.
> > > > > We could also experiment with Opaque vectors.
> > > > >
> > > > > For reference, similarly motivated improvements have been done in
> the
> > > > past
> > > > > [6]
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > See:
> > > > > [1]
> > > > >
> > > >
> >
> https://github.com/apache/parquet-format/tree/master/proposals#active-proposals
> > > > > [2] https://github.com/apache/arrow/pull/48345
> > > > > [3] https://github.com/apache/arrow/pull/48232
> > > > > [4] https://docs.vortex.dev/#in-memory
> > > > > [5]
> > > > >
> > > >
> >
> https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex
> > > > > [6]
> > > > >
> > > >
> >
> https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Reply via email to