Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Micah Kornfield Thu, 18 Dec 2025 17:38:31 -0800

Hi Pierre,

I'm somewhat expanding on Weston's  and Felipe's points above, but wanted
to make it explicit.


There are really two problems being discussed.

1. Arrow's representation of the data (either in memory or on disk).

I think the main complication brought up here is Arrow data stream model
couples physical and logical type.  Making it hard to switch to different
encodings between different batches of data.  I think there are two main
parts to resolving this:
    a.  Suggest a modelling in the flatbuffer model (e.g.
https://github.com/apache/arrow/blob/main/format/Schema.fbs) and c-ffi (
https://arrow.apache.org/docs/format/CDataInterface.html) to disentangle
the two of them.
    b.  Update libraries to accommodate the decoupling, so data can be
exchanged with the new modelling.

 2. How implementations carry this through and don't end up breaking
existing functionality around Arrow structures

   a. As Felipe pointed out I think at least a few implementations of
operations don't scale well to adding multiple different encodings.
Rethinking the architecture here might be required.


My take is that "1a" is probably the easy part.  1b and 2a are the hard
parts because there have been a lot of core assumptions baked into
libraries about the structure of  Arrow array.  If we can convince
ourselves that there is a tractable path forward for hard parts I would
guess it would go a long way to convincing skeptics.  The easiest way of
proving or disproving something like this would be trying to prototype
something in at least one of the libraries to see what the complexity looks
like (e.g. in arrow-rs or arrow-cpp which have sizable compute
implementations).

In short, the spec changes probably are possibly not super large, but the
downstream implications of them are.  As you and others have pointed out, I
think it is worth trying to tackle this issue for the long term viability
of the project, but I'm concerned there might not be enough developer
bandwidth to make it happen.

Cheers,
Micah




On Thu, Dec 18, 2025 at 11:54 AM Pierre Lacave <[email protected]> wrote:

> I've been following this thread with interest and wanted to share a more
> specific concern regarding the potential overhead of data expansion where I
> think this really matters in our case.
>
>
> It seems that as Parquet adopts more sophisticated, non-data-dependent
> encodings like ALP and FSST, there might be a risk of Arrow becoming a
> bottleneck if the format requires immediate "densification" into standard
> arrays.
>
> I am particularly thinking about use cases like compaction or merging of
> multiple files. In these scenarios, it feels like we might be paying a
> significant CPU and memory tax to expand encoded Parquet data into a flat
> Arrow representation, even if the operation itself, essentially a
> pass-through or a reorganization and might not strictly require that
> expansion.
>
> I (think) I understand the concerns with adding more types and late
> materialisation concepts.
>
> However I'm curious if there is a path for Arrow to evolve that would
> allow for this kind of efficiency avoiding said complexity , perhaps by:
>   * Potentially exploring a negotiation mechanism where the consumer can
> opt-out of full "densification" if it understands the underlying encoding -
> the onus of compute left to the user.
>   * Considering if "semi-opaque" or encoded vectors could be supported for
> operations that don't require full decoding, such as simple slices or
> merges.
>
>
> If the goal of Arrow is to remain the gold standard for interoperability,
> I wonder if it might eventually need to account for these "on-encoded data"
> patterns to keep pace with the efficiencies found in modern storage formats.
>
> I'd be very interested to hear if this is a direction the community feels
> is viable or if these optimizations are viewed as being better handled
> strictly within the compute engines themselves.
>
> I am (still) quite new to Arrow and still catching up on the nuances of
> the specification, so please excuse any naive assumptions in my reasoning.
>
> Best!
>
> On 2025/12/18 17:55:58 Weston Pace wrote:
> > > the real challenge is having compute kernels that are complete
> > > so they can be flexible enough to handle all the physical forms
> > > of a logical type
> >
> > Agreed.  My experience has also been that even when compute systems claim
> > to solve this problem (e.g. DuckDb) the implementations are still rather
> > limited.  Part of the problem is also, I think, that only a small
> fraction
> > of cases are going to be compute bound in a way that this sort of
> > optimization will help.  Beyond major vendors (e.g. Databricks,
> Snowflake,
> > Meta, ...) the scale just isn't there to justify the cost of development
> > and maintenance.
> >
> > > The definition of any logical type can only be coupled to a compute
> > system.
> >
> > I guess I disagree here.  We did this once already in Substrait.  I think
> > there is value in an interoperable agreement of what constitutes a
> logical
> > type.  Maybe it can live in Substrait then since there is also a growing
> > collection there of compute function definitions and interpretations.
> >
> > > Even saying that the "string" type should include REE-encoded and
> > dictionary-encoded strings is a stretch today.
> >
> > Why?  My semi-formal internal definition has been "We say types T1 and T2
> > are the same logical type if there exists a bidirectional 1:1 mapping
> > function M from T1 to T2 such that given every function F and every
> value x
> > we have F(x) = M(F(M(x))".  Under this definition dictionary and REE are
> > just encodings while uint8 and uint16 are different types (since they
> have
> > different results for functions like 100+200).
> >
> > > Things start to break when you start exporting arrays to more naive
> > compute systems that don't support these encodings yet.
> >
> > This is true regardless
> >
> > > it will be unrealistic to expect more than one implementation of such
> > system.
> >
> > Agreed, implementing all the compute kernels is not an Arrow problem.
> > Though maybe those waters have been muddied in the past since Arrow
> > implementations like arrow-cpp and arrow-rs have historically come with a
> > collection of compute functions.
> >
> > > A semi-opaque format could be designed while allowing slicing
> >
> > Ah, by leaving the buffers untouched but setting the offset?  That makes
> > sense.
> >
> > > Things get trickier when concatenating arrays ... and exporting arrays
> > ... without leaking data
> >
> > I think we can also add operations like filter and take to that list.
> >
> > On Tue, Dec 16, 2025 at 6:59 PM Felipe Oliveira Carvalho <
> > [email protected]> wrote:
> >
> > > Please don't interpret this as a harsh comment, I mean well and don't
> speak
> > > for the whole community.
> > >
> > > >  I think we would need a "semantic schema" or "logical schema" which
> > > indicates the logical type but not the physical representation.
> > >
> > > This separation between logical and physical gets mentioned a lot, but
> even
> > > if the type-representation problem is solved, the real challenge is
> having
> > > compute kernels that are complete so they can be flexible enough to
> handle
> > > all the physical forms of a logical type like "string" or have a smart
> > > enough function dispatcher that performs casts on-the-fly as a
> non-ideal
> > > fallback (e.g. expanding FSST-encoded strings to a more basic and
> widely
> > > supported format when the compute kernel is more naive).
> > >
> > > That is what makes introducing these ideas to Arrow so tricky. Arrow
> being
> > > the data format that shines in interoperability can't have unbounded
> > > complexity. The definition of any logical type can only be coupled to a
> > > compute system. Datafusion could define all the encodings that form a
> > > logical type, but it's really hard to specify what a logical type is
> meant
> > > to be on every compute system using Arrow. Even saying that the
> "string"
> > > type should include REE-encoded and dictionary-encoded strings is a
> stretch
> > > today. Things start to break when you start exporting arrays to more
> naive
> > > compute systems that don't support these encodings yet. If you care
> less
> > > about multiple implementations of a format, interoperability, and can
> > > provide all compute functions in your closed system, then you can
> greatly
> > > expand the formats and encodings you support. But it will be
> unrealistic to
> > > expect more than one implementation of such system.
> > >
> > > > Arrow users typically expect to be able to perform operations like
> > > "slice" and "take" which require some knowledge of the underlying type.
> > >
> > > Exactly. There are many simplifying assumptions that can be made when
> one
> > > gets an Arrow RecordBatch and wants to do something with it directly
> (i.e.
> > > without using a library of compute kernels). It's already challenging
> > > enough to get people to stop converting columnar data to row-based
> arrays.
> > >
> > > My recommendation is that we start thinking about proposing formats and
> > > logical types to "compute systems" and not to the "arrow data format'.
> IMO
> > > "late materialization" doesn't make sense as an Arrow specification.
> Unless
> > > it's a series of widely useful canonical extension types expressible
> > > through other storage types like binary or fixed-size-binary.
> > >
> > > A compute system (e.g. Datafusion) that intends to implement late
> > > materialization would have to expand operand representation a bit to
> take
> > > in non-materialized data handles in ways that aren't expressible in the
> > > open Arrow format.
> > >
> > > > Do you think we would come up with a semi-opaque array that could be
> > > sliced?  Or that we would introduce the concept of an unsliceable
> array?
> > >
> > > The String/BinaryView format, when sliced doesn't necessarily stop
> carrying
> > > the data buffers. A semi-opaque format could be designed while allowing
> > > slicing, Things get trickier when concatenating arrays (which is also
> an
> > > operation that is meant to discard unnecessary data when it reallocates
> > > buffers) and exporting arrays through the C Data Interface without
> leaking
> > > data that is not necessary in a slice.
> > >
> > > --
> > > Felipe
> > >
> > >
> > >
> > >
> > > On Thu, Dec 11, 2025 at 10:33 AM Weston Pace <[email protected]>
> > > wrote:
> > >
> > > > I think this is a very interesting idea.  This could potentially
> open up
> > > > the door for things like adding compute kernels for these compressed
> > > > representations to Arrow or Datafusion.  Though it isn't without some
> > > > challenges.
> > > >
> > > > > It seems FSSTStringVector/Array could potentially be modelled
> > > > > as an extension type
> > > > > ...
> > > > > This would however require a fixed dictionary, so might not
> > > > > be desirable.
> > > > > ...
> > > > > ALPFloatingPointVector and bit-packed vectors/arrays are more
> > > challenging
> > > > > to represent as extension types.
> > > > > ...
> > > > > Each batch of values has a different metadata parameter set.
> > > >
> > > > I think these are basically the same problem.  From what I've seen in
> > > > implementations a format will typically introduce some kind of small
> > > batch
> > > > concept (I think every 1024 values in the Fast Lanes paper IIRC).  So
> > > > either we need individual record batches for each small batch (in
> which
> > > > case the Arrow representation is more straightforward but batches are
> > > quite
> > > > small) or we need some concept of a batched array in Arrow.  If we
> want
> > > > individual record batches for each small batch that requires the
> batch
> > > > sizes to be consistent (in number of rows) between columns and I
> don't
> > > know
> > > > if that's always true.
> > > >
> > > > > One of the discussion items is to allow late materialization: to
> allow
> > > > > keeping data in encoded format beyond the filter stage (for
> example in
> > > > > Datafusion).
> > > >
> > > > > Vortex seems to show that it is possible to support advanced
> > > > > encodings (like ALP, FSST, or others) by separating the logical
> type
> > > > > from the physical encoding.
> > > >
> > > > Pierre brings up another challenge in achieving this goal, which may
> be
> > > > more significant.  The compression and encoding techniques typically
> vary
> > > > from page to page within Parquet (this is even more true in formats
> like
> > > > fast lanes and vortex).  A column might use ALP for one page and
> then use
> > > > PLAIN encoding for the next page.  This makes it difficult to
> represent a
> > > > stream of data with the typical Arrow schema we have today.  I think
> we
> > > > would need a "semantic schema" or "logical schema" which indicates
> the
> > > > logical type but not the physical representation.  Still, that can
> be an
> > > > orthogonal discussion to FSST and ALP representation.
> > > >
> > > > >  We could also experiment with Opaque vectors.
> > > >
> > > > This could be an interesting approach too.  I don't know if they
> could be
> > > > entirely opaque though.  Arrow users typically expect to be able to
> > > perform
> > > > operations like "slice" and "take" which require some knowledge of
> the
> > > > underlying type.  Do you think we would come up with a semi-opaque
> array
> > > > that could be sliced?  Or that we would introduce the concept of an
> > > > unsliceable array?
> > > >
> > > >
> > > > On Thu, Dec 11, 2025 at 5:27 AM Pierre Lacave <[email protected]>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I am relatively new to this space, so I apologize if I am missing
> some
> > > > > context or history here. I wanted to share some observations from
> what
> > > I
> > > > > see happening with projects like Vortex.
> > > > >
> > > > > Vortex seems to show that it is possible to support advanced
> encodings
> > > > > (like ALP, FSST, or others) by separating the logical type from the
> > > > > physical encoding. If the consumer engine supports the advanced
> > > encoding,
> > > > > it stays compressed and fast. If not, the data is "canonicalized"
> to
> > > > > standard Arrow arrays at the edge.
> > > > >
> > > > > As Parquet adopts these novel encodings, the current Arrow approach
> > > > forces
> > > > > us to "densify" or decompress data immediately, even if the engine
> > > could
> > > > > have operated on the encoded data.
> > > > >
> > > > > Is there a world where Arrow could offer some sort of negotiation
> > > > > mechanism? The goal would be to guarantee the data can always be
> read
> > > as
> > > > > standard "safe" physical types (paying a cost only at the
> boundary),
> > > > while
> > > > > allowing systems that understand the advanced encoding to let the
> data
> > > > flow
> > > > > through efficiently.
> > > > >
> > > > > This sounds like it keep the safety of the interoperability - Arrow
> > > > making
> > > > > sure new encodings have a canonical representation - and it leave
> the
> > > > onus
> > > > > of implemented the efficient flow to the consumer - decoupling
> > > efficiency
> > > > > from interoperability.
> > > > >
> > > > > Thanks !
> > > > >
> > > > > Pierre
> > > > >
> > > > > On 2025/12/11 06:49:30 Micah Kornfield wrote:
> > > > > > I think this is an interesting idea.  Julien, do you have a
> proposal
> > > > for
> > > > > > scope?  Is the intent to be 1:1 with any new encoding that is
> added
> > > to
> > > > > > Parquet?  For instance would the intent be to also put cascading
> > > > > encodings
> > > > > > in Arrow?
> > > > > >
> > > > > > We could also experiment with Opaque vectors.
> > > > > >
> > > > > >
> > > > > > Did you mean this as a new type? I think this would be necessary
> for
> > > > ALP.
> > > > > >
> > > > > > It seems FSSTStringVector/Array could potentially be modelled as
> an
> > > > > > extension type (dictionary stored as part of the type metadata?)
> on
> > > top
> > > > > of
> > > > > > a byte array. This would however require a fixed dictionary, so
> might
> > > > not
> > > > > > be desirable.
> > > > > >
> > > > > > ALPFloatingPointVector and bit-packed vectors/arrays are more
> > > > challenging
> > > > > > to represent as extension types.
> > > > > >
> > > > > > 1.  There is no natural alignment with any of the existing types
> (and
> > > > the
> > > > > > bit-packing width can effectively vary by batch).
> > > > > > 2.  Each batch of values has a different metadata parameter set.
> > > > > >
> > > > > > So it seems there is no easy way out for the ALP encoding and we
> > > either
> > > > > > need to pay the cost of adding a new type (which is not
> necessarily
> > > > > > trivial) or we would have to do some work to literally make a new
> > > > opaque
> > > > > > "Custom" Type, which would have a buffer that is only
> interpretable
> > > > based
> > > > > > on its extension type.  An easy way of shoe-horning this in
> would be
> > > to
> > > > > add
> > > > > > a ParquetScalar extension type, which simply contains the
> > > decompressed
> > > > > but
> > > > > > encoded Parquet page with repetition and definition levels
> stripped
> > > > out.
> > > > > > The latter also has its obvious down-sides.
> > > > > >
> > > > > > Cheers,
> > > > > > Micah
> > > > > >
> > > > > > [1]
> https://github.com/apache/arrow/blob/main/format/Schema.fbs#L160
> > > > > > [2] https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
> > > > > >
> > > > > > On Wed, Dec 10, 2025 at 5:44 PM Julien Le Dem <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > I forgot to mention that those encodings have the
> particularity of
> > > > > allowing
> > > > > > > random access without decoding previous values.
> > > > > > >
> > > > > > > On Wed, Dec 10, 2025 at 5:40 PM Julien Le Dem <
> [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hello,
> > > > > > > > Parquet is in the process of adopting new encodings [1]
> > > (Currently
> > > > > in POC
> > > > > > > > stage), specifically ALP [2] and FSST [3].
> > > > > > > > One of the discussion items is to allow late
> materialization: to
> > > > > allow
> > > > > > > > keeping data in encoded format beyond the filter stage (for
> > > example
> > > > > in
> > > > > > > > Datafusion).
> > > > > > > > There are several advantages to this:
> > > > > > > > - For example, if I summarize FSST as a variation of
> dictionary
> > > > > encoding
> > > > > > > > on substrings in the values, one can evaluate some
> operations on
> > > > > encoded
> > > > > > > > values without decoding them, saving memory and CPU.
> > > > > > > > - Similarly, simplifying for brevity, ALP converts floating
> point
> > > > > values
> > > > > > > > to small integers that are then bitpacked.
> > > > > > > > The Vortex project argues that keeping encoded values in
> > > in-memory
> > > > > > > vectors
> > > > > > > > opens up opportunities for performance improvements. [4] a
> third
> > > > > party
> > > > > > > blog
> > > > > > > > argues it's a problem as well [5]
> > > > > > > >
> > > > > > > > So I wanted to start a discussion to suggest, we might
> consider
> > > > > adding
> > > > > > > > some additional vectors to support such encoded Values like
> an
> > > > > > > > FSSTStringVector for example. This would not be too different
> > > from
> > > > > the
> > > > > > > > dictionary encoding, or an ALPFloatingPointVector with a bit
> > > packed
> > > > > > > scheme
> > > > > > > > not too different from what we use for nullability.
> > > > > > > > We could also experiment with Opaque vectors.
> > > > > > > >
> > > > > > > > For reference, similarly motivated improvements have been
> done in
> > > > the
> > > > > > > past
> > > > > > > > [6]
> > > > > > > >
> > > > > > > > Thoughts?
> > > > > > > >
> > > > > > > > See:
> > > > > > > > [1]
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/parquet-format/tree/master/proposals#active-proposals
> > > > > > > > [2] https://github.com/apache/arrow/pull/48345
> > > > > > > > [3] https://github.com/apache/arrow/pull/48232
> > > > > > > > [4] https://docs.vortex.dev/#in-memory
> > > > > > > > [5]
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex
> > > > > > > > [6]
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Reply via email to