Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Andrew Lamb Mon, 05 Jan 2026 04:29:14 -0800

Sorry this is a bit late, but there is also a more recent blog about late
materialization[1].


Vladimir, I would love to hear how your project went / if you were able to
apply Raphael's suggestions. I think it would be really valuable to add
examples for doing this usecase, so I have filed a ticket to track it
here[2][3].

Andrew

[1]:
https://arrow.apache.org/blog/2025/12/11/parquet-late-materialization-deep-dive/
[2]: https://github.com/apache/arrow-rs/issues/9095
[3]: https://github.com/apache/arrow-rs/issues/9096

On Tue, Dec 30, 2025 at 2:00 PM Raphael Taylor-Davies <
[email protected]> wrote:

> Hi Vladimir,
>
> What you are requesting is already supported in parquet-rs. In
> particular if you request a UTF8 or Binary DictionaryArray for the
> column it will decode the column preserving the dictionary encoding. You
> can override the embedded arrow schema, if any, using
> ArrowReaderOptions::with_schema [1]. Provided you don't read RecordBatch
> across row groups and therefore across dictionaries, which the async
> reader doesn't, this should never materialize the dictionary. FWIW the
> ViewArray decodeders will also preserve the dictionary encoding,
> however, the dictionary encoded nature is less explicit in the resulting
> arrays.
>
> As for using integer comparisons to optimise dictionary filtering, you
> should be able to construct an ArrowPredicate that computes the filter
> for the dictionary values, caches this for future use, e.g. using ptr_eq
> to detect when the dictionary changes, and then filters based on
> dictionary keys.
>
> It's a bit dated, but this post from 2022 outlines some of the
> techniques supported by parquet-rs [2] including its support for late
> materialization.
>
> Kind Regards,
>
> Raphael
>
> [2]:
>
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
>
> On 30/12/2025 17:52, Vladimir Ozerov wrote:
> > Hi,
> >
> > We are facing a somewhat similar issue at the moment in Rust. We have a
> > very large Parquet data set, where application typically applies one or
> > several equality filters on a low-cardinality strings. Data is organized
> in
> > a way so that late materialization (i.e. pushing a predicate directly to
> > Parquet reader to build a selection vector and prune individual pages) is
> > important. We quickly hit the CPU wall, and profiling shows that most of
> > the time is spent on filter evaluation, i.e. comparing strings. Since
> > filter columns has very low cardinality and are dictionary encoded, an
> > obvious optimization is to map filter to a dictionary index, and then
> > compare ints instead of strings. But since we cannot extract dictionary
> > from Parquet in any form, this optimization is largely impossible without
> > forking the parquet-rs, which is highly undesirable.
> >
> > Discussing how to expose dictionary data may lead to multiple overlapping
> > considerations, long discussions and perhaps format and API changes. So
> we
> > hope that there could be some loopholes or small change that could
> > potentially unblock such optimization without going into a large
> design/API
> > space. For instance:
> >
> >     1. Can we introduce a hint to ParquetReader which will produce
> >     DictionaryArray for the given column instead of a concrete array
> >     (StringViewArray in our case)?
> >     2. When doing late materialization, maybe we can extend
> ArrowPredicate,
> >     so that it first instructs Parquet reader that it wants to get
> encoded
> >     dictionaries first, and once they are supplied, return another
> predicate
> >     that will be applied to encoded data. E.g., "x = some_value"
> translates to
> >     "x_encoded = index".
> >
> > These are just very rough ideas, but we need a solution for this pretty
> > fast because CPU overhead blocks our project, and we cannot change the
> > dataset. So most likely we will end up with some sort of fork. But if the
> > community gives us some directions what else we can try, we can share our
> > experience and results in several weeks, and hopefully this would help
> > community build a better solution for this problem.
> >
> > Regards,
> > Vladimir
> >
> >
> > On Fri, Dec 19, 2025 at 4:38 AM Micah Kornfield <[email protected]>
> > wrote:
> >
> >> Hi Pierre,
> >>
> >> I'm somewhat expanding on Weston's  and Felipe's points above, but
> wanted
> >> to make it explicit.
> >>
> >> There are really two problems being discussed.
> >>
> >> 1. Arrow's representation of the data (either in memory or on disk).
> >>
> >> I think the main complication brought up here is Arrow data stream model
> >> couples physical and logical type.  Making it hard to switch to
> different
> >> encodings between different batches of data.  I think there are two main
> >> parts to resolving this:
> >>      a.  Suggest a modelling in the flatbuffer model (e.g.
> >> https://github.com/apache/arrow/blob/main/format/Schema.fbs) and c-ffi
> (
> >> https://arrow.apache.org/docs/format/CDataInterface.html) to
> disentangle
> >> the two of them.
> >>      b.  Update libraries to accommodate the decoupling, so data can be
> >> exchanged with the new modelling.
> >>
> >>   2. How implementations carry this through and don't end up breaking
> >> existing functionality around Arrow structures
> >>
> >>     a. As Felipe pointed out I think at least a few implementations of
> >> operations don't scale well to adding multiple different encodings.
> >> Rethinking the architecture here might be required.
> >>
> >>
> >> My take is that "1a" is probably the easy part.  1b and 2a are the hard
> >> parts because there have been a lot of core assumptions baked into
> >> libraries about the structure of  Arrow array.  If we can convince
> >> ourselves that there is a tractable path forward for hard parts I would
> >> guess it would go a long way to convincing skeptics.  The easiest way of
> >> proving or disproving something like this would be trying to prototype
> >> something in at least one of the libraries to see what the complexity
> looks
> >> like (e.g. in arrow-rs or arrow-cpp which have sizable compute
> >> implementations).
> >>
> >> In short, the spec changes probably are possibly not super large, but
> the
> >> downstream implications of them are.  As you and others have pointed
> out, I
> >> think it is worth trying to tackle this issue for the long term
> viability
> >> of the project, but I'm concerned there might not be enough developer
> >> bandwidth to make it happen.
> >>
> >> Cheers,
> >> Micah
> >>
> >>
> >>
> >>
> >> On Thu, Dec 18, 2025 at 11:54 AM Pierre Lacave <[email protected]>
> wrote:
> >>
> >>> I've been following this thread with interest and wanted to share a
> more
> >>> specific concern regarding the potential overhead of data expansion
> >> where I
> >>> think this really matters in our case.
> >>>
> >>>
> >>> It seems that as Parquet adopts more sophisticated, non-data-dependent
> >>> encodings like ALP and FSST, there might be a risk of Arrow becoming a
> >>> bottleneck if the format requires immediate "densification" into
> standard
> >>> arrays.
> >>>
> >>> I am particularly thinking about use cases like compaction or merging
> of
> >>> multiple files. In these scenarios, it feels like we might be paying a
> >>> significant CPU and memory tax to expand encoded Parquet data into a
> flat
> >>> Arrow representation, even if the operation itself, essentially a
> >>> pass-through or a reorganization and might not strictly require that
> >>> expansion.
> >>>
> >>> I (think) I understand the concerns with adding more types and late
> >>> materialisation concepts.
> >>>
> >>> However I'm curious if there is a path for Arrow to evolve that would
> >>> allow for this kind of efficiency avoiding said complexity , perhaps
> by:
> >>>    * Potentially exploring a negotiation mechanism where the consumer
> can
> >>> opt-out of full "densification" if it understands the underlying
> >> encoding -
> >>> the onus of compute left to the user.
> >>>    * Considering if "semi-opaque" or encoded vectors could be supported
> >> for
> >>> operations that don't require full decoding, such as simple slices or
> >>> merges.
> >>>
> >>>
> >>> If the goal of Arrow is to remain the gold standard for
> interoperability,
> >>> I wonder if it might eventually need to account for these "on-encoded
> >> data"
> >>> patterns to keep pace with the efficiencies found in modern storage
> >> formats.
> >>> I'd be very interested to hear if this is a direction the community
> feels
> >>> is viable or if these optimizations are viewed as being better handled
> >>> strictly within the compute engines themselves.
> >>>
> >>> I am (still) quite new to Arrow and still catching up on the nuances of
> >>> the specification, so please excuse any naive assumptions in my
> >> reasoning.
> >>> Best!
> >>>
> >>> On 2025/12/18 17:55:58 Weston Pace wrote:
> >>>>> the real challenge is having compute kernels that are complete
> >>>>> so they can be flexible enough to handle all the physical forms
> >>>>> of a logical type
> >>>> Agreed.  My experience has also been that even when compute systems
> >> claim
> >>>> to solve this problem (e.g. DuckDb) the implementations are still
> >> rather
> >>>> limited.  Part of the problem is also, I think, that only a small
> >>> fraction
> >>>> of cases are going to be compute bound in a way that this sort of
> >>>> optimization will help.  Beyond major vendors (e.g. Databricks,
> >>> Snowflake,
> >>>> Meta, ...) the scale just isn't there to justify the cost of
> >> development
> >>>> and maintenance.
> >>>>
> >>>>> The definition of any logical type can only be coupled to a compute
> >>>> system.
> >>>>
> >>>> I guess I disagree here.  We did this once already in Substrait.  I
> >> think
> >>>> there is value in an interoperable agreement of what constitutes a
> >>> logical
> >>>> type.  Maybe it can live in Substrait then since there is also a
> >> growing
> >>>> collection there of compute function definitions and interpretations.
> >>>>
> >>>>> Even saying that the "string" type should include REE-encoded and
> >>>> dictionary-encoded strings is a stretch today.
> >>>>
> >>>> Why?  My semi-formal internal definition has been "We say types T1 and
> >> T2
> >>>> are the same logical type if there exists a bidirectional 1:1 mapping
> >>>> function M from T1 to T2 such that given every function F and every
> >>> value x
> >>>> we have F(x) = M(F(M(x))".  Under this definition dictionary and REE
> >> are
> >>>> just encodings while uint8 and uint16 are different types (since they
> >>> have
> >>>> different results for functions like 100+200).
> >>>>
> >>>>> Things start to break when you start exporting arrays to more naive
> >>>> compute systems that don't support these encodings yet.
> >>>>
> >>>> This is true regardless
> >>>>
> >>>>> it will be unrealistic to expect more than one implementation of such
> >>>> system.
> >>>>
> >>>> Agreed, implementing all the compute kernels is not an Arrow problem.
> >>>> Though maybe those waters have been muddied in the past since Arrow
> >>>> implementations like arrow-cpp and arrow-rs have historically come
> >> with a
> >>>> collection of compute functions.
> >>>>
> >>>>> A semi-opaque format could be designed while allowing slicing
> >>>> Ah, by leaving the buffers untouched but setting the offset?  That
> >> makes
> >>>> sense.
> >>>>
> >>>>> Things get trickier when concatenating arrays ... and exporting
> >> arrays
> >>>> ... without leaking data
> >>>>
> >>>> I think we can also add operations like filter and take to that list.
> >>>>
> >>>> On Tue, Dec 16, 2025 at 6:59 PM Felipe Oliveira Carvalho <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> Please don't interpret this as a harsh comment, I mean well and don't
> >>> speak
> >>>>> for the whole community.
> >>>>>
> >>>>>>   I think we would need a "semantic schema" or "logical schema"
> >> which
> >>>>> indicates the logical type but not the physical representation.
> >>>>>
> >>>>> This separation between logical and physical gets mentioned a lot,
> >> but
> >>> even
> >>>>> if the type-representation problem is solved, the real challenge is
> >>> having
> >>>>> compute kernels that are complete so they can be flexible enough to
> >>> handle
> >>>>> all the physical forms of a logical type like "string" or have a
> >> smart
> >>>>> enough function dispatcher that performs casts on-the-fly as a
> >>> non-ideal
> >>>>> fallback (e.g. expanding FSST-encoded strings to a more basic and
> >>> widely
> >>>>> supported format when the compute kernel is more naive).
> >>>>>
> >>>>> That is what makes introducing these ideas to Arrow so tricky. Arrow
> >>> being
> >>>>> the data format that shines in interoperability can't have unbounded
> >>>>> complexity. The definition of any logical type can only be coupled
> >> to a
> >>>>> compute system. Datafusion could define all the encodings that form a
> >>>>> logical type, but it's really hard to specify what a logical type is
> >>> meant
> >>>>> to be on every compute system using Arrow. Even saying that the
> >>> "string"
> >>>>> type should include REE-encoded and dictionary-encoded strings is a
> >>> stretch
> >>>>> today. Things start to break when you start exporting arrays to more
> >>> naive
> >>>>> compute systems that don't support these encodings yet. If you care
> >>> less
> >>>>> about multiple implementations of a format, interoperability, and can
> >>>>> provide all compute functions in your closed system, then you can
> >>> greatly
> >>>>> expand the formats and encodings you support. But it will be
> >>> unrealistic to
> >>>>> expect more than one implementation of such system.
> >>>>>
> >>>>>> Arrow users typically expect to be able to perform operations like
> >>>>> "slice" and "take" which require some knowledge of the underlying
> >> type.
> >>>>> Exactly. There are many simplifying assumptions that can be made when
> >>> one
> >>>>> gets an Arrow RecordBatch and wants to do something with it directly
> >>> (i.e.
> >>>>> without using a library of compute kernels). It's already challenging
> >>>>> enough to get people to stop converting columnar data to row-based
> >>> arrays.
> >>>>> My recommendation is that we start thinking about proposing formats
> >> and
> >>>>> logical types to "compute systems" and not to the "arrow data
> >> format'.
> >>> IMO
> >>>>> "late materialization" doesn't make sense as an Arrow specification.
> >>> Unless
> >>>>> it's a series of widely useful canonical extension types expressible
> >>>>> through other storage types like binary or fixed-size-binary.
> >>>>>
> >>>>> A compute system (e.g. Datafusion) that intends to implement late
> >>>>> materialization would have to expand operand representation a bit to
> >>> take
> >>>>> in non-materialized data handles in ways that aren't expressible in
> >> the
> >>>>> open Arrow format.
> >>>>>
> >>>>>> Do you think we would come up with a semi-opaque array that could
> >> be
> >>>>> sliced?  Or that we would introduce the concept of an unsliceable
> >>> array?
> >>>>> The String/BinaryView format, when sliced doesn't necessarily stop
> >>> carrying
> >>>>> the data buffers. A semi-opaque format could be designed while
> >> allowing
> >>>>> slicing, Things get trickier when concatenating arrays (which is also
> >>> an
> >>>>> operation that is meant to discard unnecessary data when it
> >> reallocates
> >>>>> buffers) and exporting arrays through the C Data Interface without
> >>> leaking
> >>>>> data that is not necessary in a slice.
> >>>>>
> >>>>> --
> >>>>> Felipe
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Dec 11, 2025 at 10:33 AM Weston Pace <[email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> I think this is a very interesting idea.  This could potentially
> >>> open up
> >>>>>> the door for things like adding compute kernels for these
> >> compressed
> >>>>>> representations to Arrow or Datafusion.  Though it isn't without
> >> some
> >>>>>> challenges.
> >>>>>>
> >>>>>>> It seems FSSTStringVector/Array could potentially be modelled
> >>>>>>> as an extension type
> >>>>>>> ...
> >>>>>>> This would however require a fixed dictionary, so might not
> >>>>>>> be desirable.
> >>>>>>> ...
> >>>>>>> ALPFloatingPointVector and bit-packed vectors/arrays are more
> >>>>> challenging
> >>>>>>> to represent as extension types.
> >>>>>>> ...
> >>>>>>> Each batch of values has a different metadata parameter set.
> >>>>>> I think these are basically the same problem.  From what I've seen
> >> in
> >>>>>> implementations a format will typically introduce some kind of
> >> small
> >>>>> batch
> >>>>>> concept (I think every 1024 values in the Fast Lanes paper IIRC).
> >> So
> >>>>>> either we need individual record batches for each small batch (in
> >>> which
> >>>>>> case the Arrow representation is more straightforward but batches
> >> are
> >>>>> quite
> >>>>>> small) or we need some concept of a batched array in Arrow.  If we
> >>> want
> >>>>>> individual record batches for each small batch that requires the
> >>> batch
> >>>>>> sizes to be consistent (in number of rows) between columns and I
> >>> don't
> >>>>> know
> >>>>>> if that's always true.
> >>>>>>
> >>>>>>> One of the discussion items is to allow late materialization: to
> >>> allow
> >>>>>>> keeping data in encoded format beyond the filter stage (for
> >>> example in
> >>>>>>> Datafusion).
> >>>>>>> Vortex seems to show that it is possible to support advanced
> >>>>>>> encodings (like ALP, FSST, or others) by separating the logical
> >>> type
> >>>>>>> from the physical encoding.
> >>>>>> Pierre brings up another challenge in achieving this goal, which
> >> may
> >>> be
> >>>>>> more significant.  The compression and encoding techniques
> >> typically
> >>> vary
> >>>>>> from page to page within Parquet (this is even more true in formats
> >>> like
> >>>>>> fast lanes and vortex).  A column might use ALP for one page and
> >>> then use
> >>>>>> PLAIN encoding for the next page.  This makes it difficult to
> >>> represent a
> >>>>>> stream of data with the typical Arrow schema we have today.  I
> >> think
> >>> we
> >>>>>> would need a "semantic schema" or "logical schema" which indicates
> >>> the
> >>>>>> logical type but not the physical representation.  Still, that can
> >>> be an
> >>>>>> orthogonal discussion to FSST and ALP representation.
> >>>>>>
> >>>>>>>   We could also experiment with Opaque vectors.
> >>>>>> This could be an interesting approach too.  I don't know if they
> >>> could be
> >>>>>> entirely opaque though.  Arrow users typically expect to be able to
> >>>>> perform
> >>>>>> operations like "slice" and "take" which require some knowledge of
> >>> the
> >>>>>> underlying type.  Do you think we would come up with a semi-opaque
> >>> array
> >>>>>> that could be sliced?  Or that we would introduce the concept of an
> >>>>>> unsliceable array?
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Dec 11, 2025 at 5:27 AM Pierre Lacave <[email protected]>
> >>>>> wrote:
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> I am relatively new to this space, so I apologize if I am missing
> >>> some
> >>>>>>> context or history here. I wanted to share some observations from
> >>> what
> >>>>> I
> >>>>>>> see happening with projects like Vortex.
> >>>>>>>
> >>>>>>> Vortex seems to show that it is possible to support advanced
> >>> encodings
> >>>>>>> (like ALP, FSST, or others) by separating the logical type from
> >> the
> >>>>>>> physical encoding. If the consumer engine supports the advanced
> >>>>> encoding,
> >>>>>>> it stays compressed and fast. If not, the data is "canonicalized"
> >>> to
> >>>>>>> standard Arrow arrays at the edge.
> >>>>>>>
> >>>>>>> As Parquet adopts these novel encodings, the current Arrow
> >> approach
> >>>>>> forces
> >>>>>>> us to "densify" or decompress data immediately, even if the
> >> engine
> >>>>> could
> >>>>>>> have operated on the encoded data.
> >>>>>>>
> >>>>>>> Is there a world where Arrow could offer some sort of negotiation
> >>>>>>> mechanism? The goal would be to guarantee the data can always be
> >>> read
> >>>>> as
> >>>>>>> standard "safe" physical types (paying a cost only at the
> >>> boundary),
> >>>>>> while
> >>>>>>> allowing systems that understand the advanced encoding to let the
> >>> data
> >>>>>> flow
> >>>>>>> through efficiently.
> >>>>>>>
> >>>>>>> This sounds like it keep the safety of the interoperability -
> >> Arrow
> >>>>>> making
> >>>>>>> sure new encodings have a canonical representation - and it leave
> >>> the
> >>>>>> onus
> >>>>>>> of implemented the efficient flow to the consumer - decoupling
> >>>>> efficiency
> >>>>>>> from interoperability.
> >>>>>>>
> >>>>>>> Thanks !
> >>>>>>>
> >>>>>>> Pierre
> >>>>>>>
> >>>>>>> On 2025/12/11 06:49:30 Micah Kornfield wrote:
> >>>>>>>> I think this is an interesting idea.  Julien, do you have a
> >>> proposal
> >>>>>> for
> >>>>>>>> scope?  Is the intent to be 1:1 with any new encoding that is
> >>> added
> >>>>> to
> >>>>>>>> Parquet?  For instance would the intent be to also put
> >> cascading
> >>>>>>> encodings
> >>>>>>>> in Arrow?
> >>>>>>>>
> >>>>>>>> We could also experiment with Opaque vectors.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Did you mean this as a new type? I think this would be
> >> necessary
> >>> for
> >>>>>> ALP.
> >>>>>>>> It seems FSSTStringVector/Array could potentially be modelled
> >> as
> >>> an
> >>>>>>>> extension type (dictionary stored as part of the type
> >> metadata?)
> >>> on
> >>>>> top
> >>>>>>> of
> >>>>>>>> a byte array. This would however require a fixed dictionary, so
> >>> might
> >>>>>> not
> >>>>>>>> be desirable.
> >>>>>>>>
> >>>>>>>> ALPFloatingPointVector and bit-packed vectors/arrays are more
> >>>>>> challenging
> >>>>>>>> to represent as extension types.
> >>>>>>>>
> >>>>>>>> 1.  There is no natural alignment with any of the existing
> >> types
> >>> (and
> >>>>>> the
> >>>>>>>> bit-packing width can effectively vary by batch).
> >>>>>>>> 2.  Each batch of values has a different metadata parameter
> >> set.
> >>>>>>>> So it seems there is no easy way out for the ALP encoding and
> >> we
> >>>>> either
> >>>>>>>> need to pay the cost of adding a new type (which is not
> >>> necessarily
> >>>>>>>> trivial) or we would have to do some work to literally make a
> >> new
> >>>>>> opaque
> >>>>>>>> "Custom" Type, which would have a buffer that is only
> >>> interpretable
> >>>>>> based
> >>>>>>>> on its extension type.  An easy way of shoe-horning this in
> >>> would be
> >>>>> to
> >>>>>>> add
> >>>>>>>> a ParquetScalar extension type, which simply contains the
> >>>>> decompressed
> >>>>>>> but
> >>>>>>>> encoded Parquet page with repetition and definition levels
> >>> stripped
> >>>>>> out.
> >>>>>>>> The latter also has its obvious down-sides.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> Micah
> >>>>>>>>
> >>>>>>>> [1]
> >>> https://github.com/apache/arrow/blob/main/format/Schema.fbs#L160
> >>>>>>>> [2] https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
> >>>>>>>>
> >>>>>>>> On Wed, Dec 10, 2025 at 5:44 PM Julien Le Dem <
> >> [email protected]
> >>>>>> wrote:
> >>>>>>>>> I forgot to mention that those encodings have the
> >>> particularity of
> >>>>>>> allowing
> >>>>>>>>> random access without decoding previous values.
> >>>>>>>>>
> >>>>>>>>> On Wed, Dec 10, 2025 at 5:40 PM Julien Le Dem <
> >>> [email protected]>
> >>>>>>> wrote:
> >>>>>>>>>> Hello,
> >>>>>>>>>> Parquet is in the process of adopting new encodings [1]
> >>>>> (Currently
> >>>>>>> in POC
> >>>>>>>>>> stage), specifically ALP [2] and FSST [3].
> >>>>>>>>>> One of the discussion items is to allow late
> >>> materialization: to
> >>>>>>> allow
> >>>>>>>>>> keeping data in encoded format beyond the filter stage (for
> >>>>> example
> >>>>>>> in
> >>>>>>>>>> Datafusion).
> >>>>>>>>>> There are several advantages to this:
> >>>>>>>>>> - For example, if I summarize FSST as a variation of
> >>> dictionary
> >>>>>>> encoding
> >>>>>>>>>> on substrings in the values, one can evaluate some
> >>> operations on
> >>>>>>> encoded
> >>>>>>>>>> values without decoding them, saving memory and CPU.
> >>>>>>>>>> - Similarly, simplifying for brevity, ALP converts floating
> >>> point
> >>>>>>> values
> >>>>>>>>>> to small integers that are then bitpacked.
> >>>>>>>>>> The Vortex project argues that keeping encoded values in
> >>>>> in-memory
> >>>>>>>>> vectors
> >>>>>>>>>> opens up opportunities for performance improvements. [4] a
> >>> third
> >>>>>>> party
> >>>>>>>>> blog
> >>>>>>>>>> argues it's a problem as well [5]
> >>>>>>>>>>
> >>>>>>>>>> So I wanted to start a discussion to suggest, we might
> >>> consider
> >>>>>>> adding
> >>>>>>>>>> some additional vectors to support such encoded Values like
> >>> an
> >>>>>>>>>> FSSTStringVector for example. This would not be too
> >> different
> >>>>> from
> >>>>>>> the
> >>>>>>>>>> dictionary encoding, or an ALPFloatingPointVector with a
> >> bit
> >>>>> packed
> >>>>>>>>> scheme
> >>>>>>>>>> not too different from what we use for nullability.
> >>>>>>>>>> We could also experiment with Opaque vectors.
> >>>>>>>>>>
> >>>>>>>>>> For reference, similarly motivated improvements have been
> >>> done in
> >>>>>> the
> >>>>>>>>> past
> >>>>>>>>>> [6]
> >>>>>>>>>>
> >>>>>>>>>> Thoughts?
> >>>>>>>>>>
> >>>>>>>>>> See:
> >>>>>>>>>> [1]
> >>>>>>>>>>
> >>
> https://github.com/apache/parquet-format/tree/master/proposals#active-proposals
> >>>>>>>>>> [2] https://github.com/apache/arrow/pull/48345
> >>>>>>>>>> [3] https://github.com/apache/arrow/pull/48232
> >>>>>>>>>> [4] https://docs.vortex.dev/#in-memory
> >>>>>>>>>> [5]
> >>>>>>>>>>
> >>
> https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex
> >>>>>>>>>> [6]
> >>>>>>>>>>
> >>
> https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
> >
>

Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Reply via email to