Hi all, A short update on the progress of this work. State of discussion can be seen here [1]. I've created a set of naive C++ implementations of the discussed designs; see here: https://gist.github.com/rok/fe4785d4a74d2e080cbad73e88cc1bef Results should be taken with a grain of salt and more of a directional rather than quantitative information.
Personally I'm leaning towards option B because it is quite expressive while still providing significant storage and writing performance improvement. [1] https://docs.google.com/document/d/1nf30OqK_UqxA4YTEZQszmOBEG56m9M5mp9rIYC2SUWc/edit?usp=sharing [2] https://gist.github.com/rok/fe4785d4a74d2e080cbad73e88cc1bef - benchmarks [3] https://github.com/rok/arrow/pull/53 - option A [4] https://github.com/rok/arrow/pull/51 - option B [5] https://github.com/rok/arrow/pull/52 - option C Rok On Tue, May 5, 2026 at 3:21 PM Rok Mihevc <[email protected]> wrote: > Hi all, > > Picking this thread back up. I've put together a design doc outlining > three options we've discussed: > > https://docs.google.com/document/d/1nf30OqK_UqxA4YTEZQszmOBEG56m9M5mp9rIYC2SUWc/edit?usp=sharing > > * Option A: logical type annotating FIXED_LEN_BYTE_ARRAY. > * Option B: new VECTOR repetition type. > * Option C: logical type annotating a normal LIST, where a recognizing > reader skips rep-level decode and an unknown reader still sees a working > LIST. A future revision would let writers omit rep-levels entirely. > > The document evaluates these against the same requirements and compares > them along six axes (backwards compatibility, composability, encoding > flexibility, implementation complexity, on-disk overhead, read > performance). The doc aims to centralize the discussion and help us pick a > direction. > Comments are open. Most useful pushback would be on the requirements > (especially the "no-fallback breaks adoption" one). > > Best, > Rok > > On Tue, Mar 3, 2026 at 8:58 PM Antoine Pitrou <[email protected]> wrote: > >> >> Hello, >> >> The downside with this approach is that the top-level "unit" type is not >> the element type. >> >> For example, if you have a FIXED_SIZE_LIST(FLOAT32, 3), then the >> top-level unit type is FIXED_LEN_BYTE_ARRAY(12). This means that >> specialized encodings such as BYTE_STREAM_SPLIT, DELTA_BINARY_PACKED or >> ALP may either be less efficient (for BYTE_STREAM_SPLIT) or not be >> applicable at all (for the latter two). >> >> I wonder if we can find an approach that doesn't emit repetition levels >> but still allows using efficient encodings for the element type. >> >> Regards >> >> Antoine. >> >> >> Le 03/03/2026 à 01:13, Rok Mihevc a écrit : >> > Hi all, >> > >> > I'd like to resurrect this thread in light of recent vectors in Parquet >> > discussion [1]. >> > There is a (now updated) proposal PR from when the thread was started >> that >> > has a nice discussion [2]. >> > >> > TLDR of the current proposal: >> > - FIXED_SIZE_LIST annotates a FIXED_LEN_BYTE_ARRAY primitive leaf with >> > FixedSizeListType { type, num_values }. >> > - type must be fixed-width and non-array (INT32, INT64, FLOAT, DOUBLE, >> > FIXED_LEN_BYTE_ARRAY); num_values > 0. >> > - type_length must match num_values encoded with PLAIN representation of >> > type. >> > - If the field is optional, the whole list value may be null; elements >> are >> > always non-null. >> > - Intentionally not a `LIST` encoding (no def/rep levels). >> > - Outer page/column encoding behavior is unchanged (any encoding valid >> for >> > `FIXED_LEN_BYTE_ARRAY` remains valid). >> > >> > I also added explicit validity requirements: writers must not emit >> > violating metadata, and readers must treat violating metadata as >> invalid. >> > >> > >> > Rok >> > >> > [1] https://lists.apache.org/thread/nmq7odlbg1p6yx0hg00clzjbc3tb1tc3 >> > [2] https://github.com/apache/parquet-format/pull/241 >> > >> > On Thu, May 16, 2024 at 4:34 AM Jan Finis <[email protected]> wrote: >> > >> >> I would love to see this! >> >> >> >> It is an important optimization for vectors, which become more and more >> >> important and ubiquitous for grounding of LLMs. >> >> >> >> Note however that the logical type route has one drawback: A logical >> type >> >> may not change the physical representation of values! Thus, if we make >> >> FIXED_SIZE_LIST just a logical type, we would still need to write >> R-Levels, >> >> as even clients not knowing this logical type need to be able to >> decode the >> >> column. We could avoid reading the R-Levels and just assume that each >> list >> >> has the fixed size, so the read path would be optimized but the write >> path >> >> wouldn't. >> >> >> >> If we want to avoid writing R-Levels altogether, a logical type >> doesn't cut >> >> it. It needs to be something different. E.g., in the schema, we could >> store >> >> an optional `count` for repeated fields. Whenever this count is >> present, we >> >> would not write R-Levels for this field (or more precisely, this field >> >> would not take part in the R-Level computation, as if it wasn't a >> repeated >> >> field). This of course is a more intrusive change, as legacy clients >> >> couldn't read such columns anymore. >> >> >> >> I don't know which of the two alternatives is better. I agree with Gang >> >> that we should probably discuss this in a PR. >> >> >> >> Cheers, >> >> Jan >> >> >> >> >> >> Am Mi., 15. Mai 2024 um 14:03 Uhr schrieb Gang Wu <[email protected]>: >> >> >> >>> Hi Rok, >> >>> >> >>> Happy to see you here :) >> >>> >> >>> According to my past experience, it would be more helpful to open >> >>> a PR against the parquet-format repository and post it here. >> >>> >> >>> Best, >> >>> Gang >> >>> >> >>> On Wed, May 15, 2024 at 7:25 PM Rok Mihevc <[email protected]> >> wrote: >> >>> >> >>>> Hi all, >> >>>> >> >>>> Arrow recently introduced FixedShapeTensor and VariableShapeTensor >> >>>> canonical extension types [1] that use FixedSizeList and >> >>> StructArray(List, >> >>>> FixedSizeList) as storage respectfully. These are targeted at machine >> >>>> learning and scientific applications that deal with large datasets >> and >> >>>> would benefit from using Parquet as on disk storage. >> >>>> >> >>>> However currently FixedSizeList is stored as List in Parquet which >> adds >> >>>> significant conversion overhead when reading and writing [2]. It >> would >> >>>> therefore be beneficial to introduce a FIXED_SIZE_LIST logical type. >> >>>> >> >>>> I would like to open a discussion on potentially adding >> FIXED_SIZE_LIST >> >>>> type and prepare a proposal if discussion supports it. >> >>>> >> >>>> >> >>>> Best, >> >>>> Rok >> >>>> >> >>>> [1] >> >>>> >> >>> >> >> >> https://arrow.apache.org/docs/format/CanonicalExtensions.html#official-list >> >>>> [2] https://github.com/apache/arrow/issues/34510 >> >>>> >> >>> >> >> >> > >> >> >>
