Sorry this is a bit late, but there is also a more recent blog about late materialization[1].
Vladimir, I would love to hear how your project went / if you were able to apply Raphael's suggestions. I think it would be really valuable to add examples for doing this usecase, so I have filed a ticket to track it here[2][3]. Andrew [1]: https://arrow.apache.org/blog/2025/12/11/parquet-late-materialization-deep-dive/ [2]: https://github.com/apache/arrow-rs/issues/9095 [3]: https://github.com/apache/arrow-rs/issues/9096 On Tue, Dec 30, 2025 at 2:00 PM Raphael Taylor-Davies < [email protected]> wrote: > Hi Vladimir, > > What you are requesting is already supported in parquet-rs. In > particular if you request a UTF8 or Binary DictionaryArray for the > column it will decode the column preserving the dictionary encoding. You > can override the embedded arrow schema, if any, using > ArrowReaderOptions::with_schema [1]. Provided you don't read RecordBatch > across row groups and therefore across dictionaries, which the async > reader doesn't, this should never materialize the dictionary. FWIW the > ViewArray decodeders will also preserve the dictionary encoding, > however, the dictionary encoded nature is less explicit in the resulting > arrays. > > As for using integer comparisons to optimise dictionary filtering, you > should be able to construct an ArrowPredicate that computes the filter > for the dictionary values, caches this for future use, e.g. using ptr_eq > to detect when the dictionary changes, and then filters based on > dictionary keys. > > It's a bit dated, but this post from 2022 outlines some of the > techniques supported by parquet-rs [2] including its support for late > materialization. > > Kind Regards, > > Raphael > > [2]: > > https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/ > > On 30/12/2025 17:52, Vladimir Ozerov wrote: > > Hi, > > > > We are facing a somewhat similar issue at the moment in Rust. We have a > > very large Parquet data set, where application typically applies one or > > several equality filters on a low-cardinality strings. Data is organized > in > > a way so that late materialization (i.e. pushing a predicate directly to > > Parquet reader to build a selection vector and prune individual pages) is > > important. We quickly hit the CPU wall, and profiling shows that most of > > the time is spent on filter evaluation, i.e. comparing strings. Since > > filter columns has very low cardinality and are dictionary encoded, an > > obvious optimization is to map filter to a dictionary index, and then > > compare ints instead of strings. But since we cannot extract dictionary > > from Parquet in any form, this optimization is largely impossible without > > forking the parquet-rs, which is highly undesirable. > > > > Discussing how to expose dictionary data may lead to multiple overlapping > > considerations, long discussions and perhaps format and API changes. So > we > > hope that there could be some loopholes or small change that could > > potentially unblock such optimization without going into a large > design/API > > space. For instance: > > > > 1. Can we introduce a hint to ParquetReader which will produce > > DictionaryArray for the given column instead of a concrete array > > (StringViewArray in our case)? > > 2. When doing late materialization, maybe we can extend > ArrowPredicate, > > so that it first instructs Parquet reader that it wants to get > encoded > > dictionaries first, and once they are supplied, return another > predicate > > that will be applied to encoded data. E.g., "x = some_value" > translates to > > "x_encoded = index". > > > > These are just very rough ideas, but we need a solution for this pretty > > fast because CPU overhead blocks our project, and we cannot change the > > dataset. So most likely we will end up with some sort of fork. But if the > > community gives us some directions what else we can try, we can share our > > experience and results in several weeks, and hopefully this would help > > community build a better solution for this problem. > > > > Regards, > > Vladimir > > > > > > On Fri, Dec 19, 2025 at 4:38 AM Micah Kornfield <[email protected]> > > wrote: > > > >> Hi Pierre, > >> > >> I'm somewhat expanding on Weston's and Felipe's points above, but > wanted > >> to make it explicit. > >> > >> There are really two problems being discussed. > >> > >> 1. Arrow's representation of the data (either in memory or on disk). > >> > >> I think the main complication brought up here is Arrow data stream model > >> couples physical and logical type. Making it hard to switch to > different > >> encodings between different batches of data. I think there are two main > >> parts to resolving this: > >> a. Suggest a modelling in the flatbuffer model (e.g. > >> https://github.com/apache/arrow/blob/main/format/Schema.fbs) and c-ffi > ( > >> https://arrow.apache.org/docs/format/CDataInterface.html) to > disentangle > >> the two of them. > >> b. Update libraries to accommodate the decoupling, so data can be > >> exchanged with the new modelling. > >> > >> 2. How implementations carry this through and don't end up breaking > >> existing functionality around Arrow structures > >> > >> a. As Felipe pointed out I think at least a few implementations of > >> operations don't scale well to adding multiple different encodings. > >> Rethinking the architecture here might be required. > >> > >> > >> My take is that "1a" is probably the easy part. 1b and 2a are the hard > >> parts because there have been a lot of core assumptions baked into > >> libraries about the structure of Arrow array. If we can convince > >> ourselves that there is a tractable path forward for hard parts I would > >> guess it would go a long way to convincing skeptics. The easiest way of > >> proving or disproving something like this would be trying to prototype > >> something in at least one of the libraries to see what the complexity > looks > >> like (e.g. in arrow-rs or arrow-cpp which have sizable compute > >> implementations). > >> > >> In short, the spec changes probably are possibly not super large, but > the > >> downstream implications of them are. As you and others have pointed > out, I > >> think it is worth trying to tackle this issue for the long term > viability > >> of the project, but I'm concerned there might not be enough developer > >> bandwidth to make it happen. > >> > >> Cheers, > >> Micah > >> > >> > >> > >> > >> On Thu, Dec 18, 2025 at 11:54 AM Pierre Lacave <[email protected]> > wrote: > >> > >>> I've been following this thread with interest and wanted to share a > more > >>> specific concern regarding the potential overhead of data expansion > >> where I > >>> think this really matters in our case. > >>> > >>> > >>> It seems that as Parquet adopts more sophisticated, non-data-dependent > >>> encodings like ALP and FSST, there might be a risk of Arrow becoming a > >>> bottleneck if the format requires immediate "densification" into > standard > >>> arrays. > >>> > >>> I am particularly thinking about use cases like compaction or merging > of > >>> multiple files. In these scenarios, it feels like we might be paying a > >>> significant CPU and memory tax to expand encoded Parquet data into a > flat > >>> Arrow representation, even if the operation itself, essentially a > >>> pass-through or a reorganization and might not strictly require that > >>> expansion. > >>> > >>> I (think) I understand the concerns with adding more types and late > >>> materialisation concepts. > >>> > >>> However I'm curious if there is a path for Arrow to evolve that would > >>> allow for this kind of efficiency avoiding said complexity , perhaps > by: > >>> * Potentially exploring a negotiation mechanism where the consumer > can > >>> opt-out of full "densification" if it understands the underlying > >> encoding - > >>> the onus of compute left to the user. > >>> * Considering if "semi-opaque" or encoded vectors could be supported > >> for > >>> operations that don't require full decoding, such as simple slices or > >>> merges. > >>> > >>> > >>> If the goal of Arrow is to remain the gold standard for > interoperability, > >>> I wonder if it might eventually need to account for these "on-encoded > >> data" > >>> patterns to keep pace with the efficiencies found in modern storage > >> formats. > >>> I'd be very interested to hear if this is a direction the community > feels > >>> is viable or if these optimizations are viewed as being better handled > >>> strictly within the compute engines themselves. > >>> > >>> I am (still) quite new to Arrow and still catching up on the nuances of > >>> the specification, so please excuse any naive assumptions in my > >> reasoning. > >>> Best! > >>> > >>> On 2025/12/18 17:55:58 Weston Pace wrote: > >>>>> the real challenge is having compute kernels that are complete > >>>>> so they can be flexible enough to handle all the physical forms > >>>>> of a logical type > >>>> Agreed. My experience has also been that even when compute systems > >> claim > >>>> to solve this problem (e.g. DuckDb) the implementations are still > >> rather > >>>> limited. Part of the problem is also, I think, that only a small > >>> fraction > >>>> of cases are going to be compute bound in a way that this sort of > >>>> optimization will help. Beyond major vendors (e.g. Databricks, > >>> Snowflake, > >>>> Meta, ...) the scale just isn't there to justify the cost of > >> development > >>>> and maintenance. > >>>> > >>>>> The definition of any logical type can only be coupled to a compute > >>>> system. > >>>> > >>>> I guess I disagree here. We did this once already in Substrait. I > >> think > >>>> there is value in an interoperable agreement of what constitutes a > >>> logical > >>>> type. Maybe it can live in Substrait then since there is also a > >> growing > >>>> collection there of compute function definitions and interpretations. > >>>> > >>>>> Even saying that the "string" type should include REE-encoded and > >>>> dictionary-encoded strings is a stretch today. > >>>> > >>>> Why? My semi-formal internal definition has been "We say types T1 and > >> T2 > >>>> are the same logical type if there exists a bidirectional 1:1 mapping > >>>> function M from T1 to T2 such that given every function F and every > >>> value x > >>>> we have F(x) = M(F(M(x))". Under this definition dictionary and REE > >> are > >>>> just encodings while uint8 and uint16 are different types (since they > >>> have > >>>> different results for functions like 100+200). > >>>> > >>>>> Things start to break when you start exporting arrays to more naive > >>>> compute systems that don't support these encodings yet. > >>>> > >>>> This is true regardless > >>>> > >>>>> it will be unrealistic to expect more than one implementation of such > >>>> system. > >>>> > >>>> Agreed, implementing all the compute kernels is not an Arrow problem. > >>>> Though maybe those waters have been muddied in the past since Arrow > >>>> implementations like arrow-cpp and arrow-rs have historically come > >> with a > >>>> collection of compute functions. > >>>> > >>>>> A semi-opaque format could be designed while allowing slicing > >>>> Ah, by leaving the buffers untouched but setting the offset? That > >> makes > >>>> sense. > >>>> > >>>>> Things get trickier when concatenating arrays ... and exporting > >> arrays > >>>> ... without leaking data > >>>> > >>>> I think we can also add operations like filter and take to that list. > >>>> > >>>> On Tue, Dec 16, 2025 at 6:59 PM Felipe Oliveira Carvalho < > >>>> [email protected]> wrote: > >>>> > >>>>> Please don't interpret this as a harsh comment, I mean well and don't > >>> speak > >>>>> for the whole community. > >>>>> > >>>>>> I think we would need a "semantic schema" or "logical schema" > >> which > >>>>> indicates the logical type but not the physical representation. > >>>>> > >>>>> This separation between logical and physical gets mentioned a lot, > >> but > >>> even > >>>>> if the type-representation problem is solved, the real challenge is > >>> having > >>>>> compute kernels that are complete so they can be flexible enough to > >>> handle > >>>>> all the physical forms of a logical type like "string" or have a > >> smart > >>>>> enough function dispatcher that performs casts on-the-fly as a > >>> non-ideal > >>>>> fallback (e.g. expanding FSST-encoded strings to a more basic and > >>> widely > >>>>> supported format when the compute kernel is more naive). > >>>>> > >>>>> That is what makes introducing these ideas to Arrow so tricky. Arrow > >>> being > >>>>> the data format that shines in interoperability can't have unbounded > >>>>> complexity. The definition of any logical type can only be coupled > >> to a > >>>>> compute system. Datafusion could define all the encodings that form a > >>>>> logical type, but it's really hard to specify what a logical type is > >>> meant > >>>>> to be on every compute system using Arrow. Even saying that the > >>> "string" > >>>>> type should include REE-encoded and dictionary-encoded strings is a > >>> stretch > >>>>> today. Things start to break when you start exporting arrays to more > >>> naive > >>>>> compute systems that don't support these encodings yet. If you care > >>> less > >>>>> about multiple implementations of a format, interoperability, and can > >>>>> provide all compute functions in your closed system, then you can > >>> greatly > >>>>> expand the formats and encodings you support. But it will be > >>> unrealistic to > >>>>> expect more than one implementation of such system. > >>>>> > >>>>>> Arrow users typically expect to be able to perform operations like > >>>>> "slice" and "take" which require some knowledge of the underlying > >> type. > >>>>> Exactly. There are many simplifying assumptions that can be made when > >>> one > >>>>> gets an Arrow RecordBatch and wants to do something with it directly > >>> (i.e. > >>>>> without using a library of compute kernels). It's already challenging > >>>>> enough to get people to stop converting columnar data to row-based > >>> arrays. > >>>>> My recommendation is that we start thinking about proposing formats > >> and > >>>>> logical types to "compute systems" and not to the "arrow data > >> format'. > >>> IMO > >>>>> "late materialization" doesn't make sense as an Arrow specification. > >>> Unless > >>>>> it's a series of widely useful canonical extension types expressible > >>>>> through other storage types like binary or fixed-size-binary. > >>>>> > >>>>> A compute system (e.g. Datafusion) that intends to implement late > >>>>> materialization would have to expand operand representation a bit to > >>> take > >>>>> in non-materialized data handles in ways that aren't expressible in > >> the > >>>>> open Arrow format. > >>>>> > >>>>>> Do you think we would come up with a semi-opaque array that could > >> be > >>>>> sliced? Or that we would introduce the concept of an unsliceable > >>> array? > >>>>> The String/BinaryView format, when sliced doesn't necessarily stop > >>> carrying > >>>>> the data buffers. A semi-opaque format could be designed while > >> allowing > >>>>> slicing, Things get trickier when concatenating arrays (which is also > >>> an > >>>>> operation that is meant to discard unnecessary data when it > >> reallocates > >>>>> buffers) and exporting arrays through the C Data Interface without > >>> leaking > >>>>> data that is not necessary in a slice. > >>>>> > >>>>> -- > >>>>> Felipe > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Thu, Dec 11, 2025 at 10:33 AM Weston Pace <[email protected]> > >>>>> wrote: > >>>>> > >>>>>> I think this is a very interesting idea. This could potentially > >>> open up > >>>>>> the door for things like adding compute kernels for these > >> compressed > >>>>>> representations to Arrow or Datafusion. Though it isn't without > >> some > >>>>>> challenges. > >>>>>> > >>>>>>> It seems FSSTStringVector/Array could potentially be modelled > >>>>>>> as an extension type > >>>>>>> ... > >>>>>>> This would however require a fixed dictionary, so might not > >>>>>>> be desirable. > >>>>>>> ... > >>>>>>> ALPFloatingPointVector and bit-packed vectors/arrays are more > >>>>> challenging > >>>>>>> to represent as extension types. > >>>>>>> ... > >>>>>>> Each batch of values has a different metadata parameter set. > >>>>>> I think these are basically the same problem. From what I've seen > >> in > >>>>>> implementations a format will typically introduce some kind of > >> small > >>>>> batch > >>>>>> concept (I think every 1024 values in the Fast Lanes paper IIRC). > >> So > >>>>>> either we need individual record batches for each small batch (in > >>> which > >>>>>> case the Arrow representation is more straightforward but batches > >> are > >>>>> quite > >>>>>> small) or we need some concept of a batched array in Arrow. If we > >>> want > >>>>>> individual record batches for each small batch that requires the > >>> batch > >>>>>> sizes to be consistent (in number of rows) between columns and I > >>> don't > >>>>> know > >>>>>> if that's always true. > >>>>>> > >>>>>>> One of the discussion items is to allow late materialization: to > >>> allow > >>>>>>> keeping data in encoded format beyond the filter stage (for > >>> example in > >>>>>>> Datafusion). > >>>>>>> Vortex seems to show that it is possible to support advanced > >>>>>>> encodings (like ALP, FSST, or others) by separating the logical > >>> type > >>>>>>> from the physical encoding. > >>>>>> Pierre brings up another challenge in achieving this goal, which > >> may > >>> be > >>>>>> more significant. The compression and encoding techniques > >> typically > >>> vary > >>>>>> from page to page within Parquet (this is even more true in formats > >>> like > >>>>>> fast lanes and vortex). A column might use ALP for one page and > >>> then use > >>>>>> PLAIN encoding for the next page. This makes it difficult to > >>> represent a > >>>>>> stream of data with the typical Arrow schema we have today. I > >> think > >>> we > >>>>>> would need a "semantic schema" or "logical schema" which indicates > >>> the > >>>>>> logical type but not the physical representation. Still, that can > >>> be an > >>>>>> orthogonal discussion to FSST and ALP representation. > >>>>>> > >>>>>>> We could also experiment with Opaque vectors. > >>>>>> This could be an interesting approach too. I don't know if they > >>> could be > >>>>>> entirely opaque though. Arrow users typically expect to be able to > >>>>> perform > >>>>>> operations like "slice" and "take" which require some knowledge of > >>> the > >>>>>> underlying type. Do you think we would come up with a semi-opaque > >>> array > >>>>>> that could be sliced? Or that we would introduce the concept of an > >>>>>> unsliceable array? > >>>>>> > >>>>>> > >>>>>> On Thu, Dec 11, 2025 at 5:27 AM Pierre Lacave <[email protected]> > >>>>> wrote: > >>>>>>> Hi all, > >>>>>>> > >>>>>>> I am relatively new to this space, so I apologize if I am missing > >>> some > >>>>>>> context or history here. I wanted to share some observations from > >>> what > >>>>> I > >>>>>>> see happening with projects like Vortex. > >>>>>>> > >>>>>>> Vortex seems to show that it is possible to support advanced > >>> encodings > >>>>>>> (like ALP, FSST, or others) by separating the logical type from > >> the > >>>>>>> physical encoding. If the consumer engine supports the advanced > >>>>> encoding, > >>>>>>> it stays compressed and fast. If not, the data is "canonicalized" > >>> to > >>>>>>> standard Arrow arrays at the edge. > >>>>>>> > >>>>>>> As Parquet adopts these novel encodings, the current Arrow > >> approach > >>>>>> forces > >>>>>>> us to "densify" or decompress data immediately, even if the > >> engine > >>>>> could > >>>>>>> have operated on the encoded data. > >>>>>>> > >>>>>>> Is there a world where Arrow could offer some sort of negotiation > >>>>>>> mechanism? The goal would be to guarantee the data can always be > >>> read > >>>>> as > >>>>>>> standard "safe" physical types (paying a cost only at the > >>> boundary), > >>>>>> while > >>>>>>> allowing systems that understand the advanced encoding to let the > >>> data > >>>>>> flow > >>>>>>> through efficiently. > >>>>>>> > >>>>>>> This sounds like it keep the safety of the interoperability - > >> Arrow > >>>>>> making > >>>>>>> sure new encodings have a canonical representation - and it leave > >>> the > >>>>>> onus > >>>>>>> of implemented the efficient flow to the consumer - decoupling > >>>>> efficiency > >>>>>>> from interoperability. > >>>>>>> > >>>>>>> Thanks ! > >>>>>>> > >>>>>>> Pierre > >>>>>>> > >>>>>>> On 2025/12/11 06:49:30 Micah Kornfield wrote: > >>>>>>>> I think this is an interesting idea. Julien, do you have a > >>> proposal > >>>>>> for > >>>>>>>> scope? Is the intent to be 1:1 with any new encoding that is > >>> added > >>>>> to > >>>>>>>> Parquet? For instance would the intent be to also put > >> cascading > >>>>>>> encodings > >>>>>>>> in Arrow? > >>>>>>>> > >>>>>>>> We could also experiment with Opaque vectors. > >>>>>>>> > >>>>>>>> > >>>>>>>> Did you mean this as a new type? I think this would be > >> necessary > >>> for > >>>>>> ALP. > >>>>>>>> It seems FSSTStringVector/Array could potentially be modelled > >> as > >>> an > >>>>>>>> extension type (dictionary stored as part of the type > >> metadata?) > >>> on > >>>>> top > >>>>>>> of > >>>>>>>> a byte array. This would however require a fixed dictionary, so > >>> might > >>>>>> not > >>>>>>>> be desirable. > >>>>>>>> > >>>>>>>> ALPFloatingPointVector and bit-packed vectors/arrays are more > >>>>>> challenging > >>>>>>>> to represent as extension types. > >>>>>>>> > >>>>>>>> 1. There is no natural alignment with any of the existing > >> types > >>> (and > >>>>>> the > >>>>>>>> bit-packing width can effectively vary by batch). > >>>>>>>> 2. Each batch of values has a different metadata parameter > >> set. > >>>>>>>> So it seems there is no easy way out for the ALP encoding and > >> we > >>>>> either > >>>>>>>> need to pay the cost of adding a new type (which is not > >>> necessarily > >>>>>>>> trivial) or we would have to do some work to literally make a > >> new > >>>>>> opaque > >>>>>>>> "Custom" Type, which would have a buffer that is only > >>> interpretable > >>>>>> based > >>>>>>>> on its extension type. An easy way of shoe-horning this in > >>> would be > >>>>> to > >>>>>>> add > >>>>>>>> a ParquetScalar extension type, which simply contains the > >>>>> decompressed > >>>>>>> but > >>>>>>>> encoded Parquet page with repetition and definition levels > >>> stripped > >>>>>> out. > >>>>>>>> The latter also has its obvious down-sides. > >>>>>>>> > >>>>>>>> Cheers, > >>>>>>>> Micah > >>>>>>>> > >>>>>>>> [1] > >>> https://github.com/apache/arrow/blob/main/format/Schema.fbs#L160 > >>>>>>>> [2] https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf > >>>>>>>> > >>>>>>>> On Wed, Dec 10, 2025 at 5:44 PM Julien Le Dem < > >> [email protected] > >>>>>> wrote: > >>>>>>>>> I forgot to mention that those encodings have the > >>> particularity of > >>>>>>> allowing > >>>>>>>>> random access without decoding previous values. > >>>>>>>>> > >>>>>>>>> On Wed, Dec 10, 2025 at 5:40 PM Julien Le Dem < > >>> [email protected]> > >>>>>>> wrote: > >>>>>>>>>> Hello, > >>>>>>>>>> Parquet is in the process of adopting new encodings [1] > >>>>> (Currently > >>>>>>> in POC > >>>>>>>>>> stage), specifically ALP [2] and FSST [3]. > >>>>>>>>>> One of the discussion items is to allow late > >>> materialization: to > >>>>>>> allow > >>>>>>>>>> keeping data in encoded format beyond the filter stage (for > >>>>> example > >>>>>>> in > >>>>>>>>>> Datafusion). > >>>>>>>>>> There are several advantages to this: > >>>>>>>>>> - For example, if I summarize FSST as a variation of > >>> dictionary > >>>>>>> encoding > >>>>>>>>>> on substrings in the values, one can evaluate some > >>> operations on > >>>>>>> encoded > >>>>>>>>>> values without decoding them, saving memory and CPU. > >>>>>>>>>> - Similarly, simplifying for brevity, ALP converts floating > >>> point > >>>>>>> values > >>>>>>>>>> to small integers that are then bitpacked. > >>>>>>>>>> The Vortex project argues that keeping encoded values in > >>>>> in-memory > >>>>>>>>> vectors > >>>>>>>>>> opens up opportunities for performance improvements. [4] a > >>> third > >>>>>>> party > >>>>>>>>> blog > >>>>>>>>>> argues it's a problem as well [5] > >>>>>>>>>> > >>>>>>>>>> So I wanted to start a discussion to suggest, we might > >>> consider > >>>>>>> adding > >>>>>>>>>> some additional vectors to support such encoded Values like > >>> an > >>>>>>>>>> FSSTStringVector for example. This would not be too > >> different > >>>>> from > >>>>>>> the > >>>>>>>>>> dictionary encoding, or an ALPFloatingPointVector with a > >> bit > >>>>> packed > >>>>>>>>> scheme > >>>>>>>>>> not too different from what we use for nullability. > >>>>>>>>>> We could also experiment with Opaque vectors. > >>>>>>>>>> > >>>>>>>>>> For reference, similarly motivated improvements have been > >>> done in > >>>>>> the > >>>>>>>>> past > >>>>>>>>>> [6] > >>>>>>>>>> > >>>>>>>>>> Thoughts? > >>>>>>>>>> > >>>>>>>>>> See: > >>>>>>>>>> [1] > >>>>>>>>>> > >> > https://github.com/apache/parquet-format/tree/master/proposals#active-proposals > >>>>>>>>>> [2] https://github.com/apache/arrow/pull/48345 > >>>>>>>>>> [3] https://github.com/apache/arrow/pull/48232 > >>>>>>>>>> [4] https://docs.vortex.dev/#in-memory > >>>>>>>>>> [5] > >>>>>>>>>> > >> > https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex > >>>>>>>>>> [6] > >>>>>>>>>> > >> > https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/ > > >
