Hi all,

A short update on the progress of this work. State of discussion can be
seen here [1].
I've created a set of naive C++ implementations of the discussed designs;
see here: https://gist.github.com/rok/fe4785d4a74d2e080cbad73e88cc1bef
Results should be taken with a grain of salt and more of a directional
rather than quantitative information.

Personally I'm leaning towards option B because it is quite expressive
while still providing significant storage and writing performance
improvement.

[1]
https://docs.google.com/document/d/1nf30OqK_UqxA4YTEZQszmOBEG56m9M5mp9rIYC2SUWc/edit?usp=sharing
[2] https://gist.github.com/rok/fe4785d4a74d2e080cbad73e88cc1bef -
benchmarks
[3] https://github.com/rok/arrow/pull/53 - option A
[4] https://github.com/rok/arrow/pull/51 - option B
[5] https://github.com/rok/arrow/pull/52 - option C

Rok

On Tue, May 5, 2026 at 3:21 PM Rok Mihevc <[email protected]> wrote:

> Hi all,
>
> Picking this thread back up. I've put together a design doc outlining
> three options we've discussed:
>
> https://docs.google.com/document/d/1nf30OqK_UqxA4YTEZQszmOBEG56m9M5mp9rIYC2SUWc/edit?usp=sharing
>
> * Option A: logical type annotating FIXED_LEN_BYTE_ARRAY.
> * Option B: new VECTOR repetition type.
> * Option C: logical type annotating a normal LIST, where a recognizing
> reader skips rep-level decode and an unknown reader still sees a working
> LIST. A future revision would let writers omit rep-levels entirely.
>
> The document evaluates these against the same requirements and compares
> them along six axes (backwards compatibility, composability, encoding
> flexibility, implementation complexity, on-disk overhead, read
> performance). The doc aims to centralize the discussion and help us pick a
> direction.
> Comments are open. Most useful pushback would be on the requirements
> (especially the "no-fallback breaks adoption" one).
>
> Best,
> Rok
>
> On Tue, Mar 3, 2026 at 8:58 PM Antoine Pitrou <[email protected]> wrote:
>
>>
>> Hello,
>>
>> The downside with this approach is that the top-level "unit" type is not
>> the element type.
>>
>> For example, if you have a FIXED_SIZE_LIST(FLOAT32, 3), then the
>> top-level unit type is FIXED_LEN_BYTE_ARRAY(12). This means that
>> specialized encodings such as BYTE_STREAM_SPLIT, DELTA_BINARY_PACKED or
>> ALP may either be less efficient (for BYTE_STREAM_SPLIT) or not be
>> applicable at all (for the latter two).
>>
>> I wonder if we can find an approach that doesn't emit repetition levels
>> but still allows using efficient encodings for the element type.
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 03/03/2026 à 01:13, Rok Mihevc a écrit :
>> > Hi all,
>> >
>> > I'd like to resurrect this thread in light of recent vectors in Parquet
>> > discussion [1].
>> > There is a (now updated) proposal PR from when the thread was started
>> that
>> > has a nice discussion [2].
>> >
>> > TLDR of the current proposal:
>> > - FIXED_SIZE_LIST annotates a FIXED_LEN_BYTE_ARRAY primitive leaf with
>> > FixedSizeListType { type, num_values }.
>> > - type must be fixed-width and non-array (INT32, INT64, FLOAT, DOUBLE,
>> > FIXED_LEN_BYTE_ARRAY); num_values > 0.
>> > - type_length must match num_values encoded with PLAIN representation of
>> > type.
>> > - If the field is optional, the whole list value may be null; elements
>> are
>> > always non-null.
>> > - Intentionally not a `LIST` encoding (no def/rep levels).
>> > - Outer page/column encoding behavior is unchanged (any encoding valid
>> for
>> > `FIXED_LEN_BYTE_ARRAY` remains valid).
>> >
>> > I also added explicit validity requirements: writers must not emit
>> > violating metadata, and readers must treat violating metadata as
>> invalid.
>> >
>> >
>> > Rok
>> >
>> > [1] https://lists.apache.org/thread/nmq7odlbg1p6yx0hg00clzjbc3tb1tc3
>> > [2] https://github.com/apache/parquet-format/pull/241
>> >
>> > On Thu, May 16, 2024 at 4:34 AM Jan Finis <[email protected]> wrote:
>> >
>> >> I would love to see this!
>> >>
>> >> It is an important optimization for vectors, which become more and more
>> >> important and ubiquitous for grounding of LLMs.
>> >>
>> >> Note however that the logical type route has one drawback: A logical
>> type
>> >> may not change the physical representation of values! Thus, if we make
>> >> FIXED_SIZE_LIST just a logical type, we would still need to write
>> R-Levels,
>> >> as even clients not knowing this logical type need to be able to
>> decode the
>> >> column. We could avoid reading the R-Levels and just assume that each
>> list
>> >> has the fixed size, so the read path would be optimized but the write
>> path
>> >> wouldn't.
>> >>
>> >> If we want to avoid writing R-Levels altogether, a logical type
>> doesn't cut
>> >> it. It needs to be something different. E.g., in the schema, we could
>> store
>> >> an optional `count` for repeated fields. Whenever this count is
>> present, we
>> >> would not write R-Levels for this field (or more precisely, this field
>> >> would not take part in the R-Level computation, as if it wasn't a
>> repeated
>> >> field). This of course is a more intrusive change, as legacy clients
>> >> couldn't read such columns anymore.
>> >>
>> >> I don't know which of the two alternatives is better. I agree with Gang
>> >> that we should probably discuss this in a PR.
>> >>
>> >> Cheers,
>> >> Jan
>> >>
>> >>
>> >> Am Mi., 15. Mai 2024 um 14:03 Uhr schrieb Gang Wu <[email protected]>:
>> >>
>> >>> Hi Rok,
>> >>>
>> >>> Happy to see you here :)
>> >>>
>> >>> According to my past experience, it would be more helpful to open
>> >>> a PR against the parquet-format repository and post it here.
>> >>>
>> >>> Best,
>> >>> Gang
>> >>>
>> >>> On Wed, May 15, 2024 at 7:25 PM Rok Mihevc <[email protected]>
>> wrote:
>> >>>
>> >>>> Hi all,
>> >>>>
>> >>>> Arrow recently introduced FixedShapeTensor and VariableShapeTensor
>> >>>> canonical extension types [1] that use FixedSizeList and
>> >>> StructArray(List,
>> >>>> FixedSizeList) as storage respectfully. These are targeted at machine
>> >>>> learning and scientific applications that deal with large datasets
>> and
>> >>>> would benefit from using Parquet as on disk storage.
>> >>>>
>> >>>> However currently FixedSizeList is stored as List in Parquet which
>> adds
>> >>>> significant conversion overhead when reading and writing [2]. It
>> would
>> >>>> therefore be beneficial to introduce a FIXED_SIZE_LIST logical type.
>> >>>>
>> >>>> I would like to open a discussion on potentially adding
>> FIXED_SIZE_LIST
>> >>>> type and prepare a proposal if discussion supports it.
>> >>>>
>> >>>>
>> >>>> Best,
>> >>>> Rok
>> >>>>
>> >>>> [1]
>> >>>>
>> >>>
>> >>
>> https://arrow.apache.org/docs/format/CanonicalExtensions.html#official-list
>> >>>> [2] https://github.com/apache/arrow/issues/34510
>> >>>>
>> >>>
>> >>
>> >
>>
>>
>>

Reply via email to