Re: [DISCUSSION] Alternative: introduce VECTOR repetition type

Rahil C Wed, 04 Mar 2026 07:07:34 -0800

Thanks Antoine and Rok for raising the respective spec change prs.

I am interested in helping with this initiative specifically for
parquet-java, but I'd like to get more clarity from the Parquet community
on the implementation steps required once we align on the spec changes? If
there is a board or a way to break down the work—perhaps on GitHub or the
mailing list—I and others could help on some of the off tasks.


Regards,
Rahil Chertara

On Wed, Mar 4, 2026 at 1:38 AM Antoine Pitrou <[email protected]> wrote:

>
> Hi,
>
> As an alternative, we could perhaps add a new repetition type so that
> the physical type remains the actual child value type.
>
> Here is a draft change against the Thrift definitions:
>
> https://github.com/apache/parquet-format/compare/master...pitrou:vector-repetition
>
> I will not be able to work on this personally, so if it is deemed
> promising, someone else should take it up :-)
>
> Regards
>
> Antoine.
>
>
> Le 03/03/2026 à 20:57, Antoine Pitrou a écrit :
> >
> > Hello,
> >
> > The downside with this approach is that the top-level "unit" type is not
> > the element type.
> >
> > For example, if you have a FIXED_SIZE_LIST(FLOAT32, 3), then the
> > top-level unit type is FIXED_LEN_BYTE_ARRAY(12). This means that
> > specialized encodings such as BYTE_STREAM_SPLIT, DELTA_BINARY_PACKED or
> > ALP may either be less efficient (for BYTE_STREAM_SPLIT) or not be
> > applicable at all (for the latter two).
> >
> > I wonder if we can find an approach that doesn't emit repetition levels
> > but still allows using efficient encodings for the element type.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 03/03/2026 à 01:13, Rok Mihevc a écrit :
> >> Hi all,
> >>
> >> I'd like to resurrect this thread in light of recent vectors in Parquet
> >> discussion [1].
> >> There is a (now updated) proposal PR from when the thread was started
> that
> >> has a nice discussion [2].
> >>
> >> TLDR of the current proposal:
> >> - FIXED_SIZE_LIST annotates a FIXED_LEN_BYTE_ARRAY primitive leaf with
> >> FixedSizeListType { type, num_values }.
> >> - type must be fixed-width and non-array (INT32, INT64, FLOAT, DOUBLE,
> >> FIXED_LEN_BYTE_ARRAY); num_values > 0.
> >> - type_length must match num_values encoded with PLAIN representation of
> >> type.
> >> - If the field is optional, the whole list value may be null; elements
> are
> >> always non-null.
> >> - Intentionally not a `LIST` encoding (no def/rep levels).
> >> - Outer page/column encoding behavior is unchanged (any encoding valid
> for
> >> `FIXED_LEN_BYTE_ARRAY` remains valid).
> >>
> >> I also added explicit validity requirements: writers must not emit
> >> violating metadata, and readers must treat violating metadata as
> invalid.
> >>
> >>
> >> Rok
> >>
> >> [1] https://lists.apache.org/thread/nmq7odlbg1p6yx0hg00clzjbc3tb1tc3
> >> [2] https://github.com/apache/parquet-format/pull/241
> >>
> >> On Thu, May 16, 2024 at 4:34 AM Jan Finis <[email protected]> wrote:
> >>
> >>> I would love to see this!
> >>>
> >>> It is an important optimization for vectors, which become more and more
> >>> important and ubiquitous for grounding of LLMs.
> >>>
> >>> Note however that the logical type route has one drawback: A logical
> type
> >>> may not change the physical representation of values! Thus, if we make
> >>> FIXED_SIZE_LIST just a logical type, we would still need to write
> R-Levels,
> >>> as even clients not knowing this logical type need to be able to
> decode the
> >>> column. We could avoid reading the R-Levels and just assume that each
> list
> >>> has the fixed size, so the read path would be optimized but the write
> path
> >>> wouldn't.
> >>>
> >>> If we want to avoid writing R-Levels altogether, a logical type
> doesn't cut
> >>> it. It needs to be something different. E.g., in the schema, we could
> store
> >>> an optional `count` for repeated fields. Whenever this count is
> present, we
> >>> would not write R-Levels for this field (or more precisely, this field
> >>> would not take part in the R-Level computation, as if it wasn't a
> repeated
> >>> field). This of course is a more intrusive change, as legacy clients
> >>> couldn't read such columns anymore.
> >>>
> >>> I don't know which of the two alternatives is better. I agree with Gang
> >>> that we should probably discuss this in a PR.
> >>>
> >>> Cheers,
> >>> Jan
> >>>
> >>>
> >>> Am Mi., 15. Mai 2024 um 14:03 Uhr schrieb Gang Wu <[email protected]>:
> >>>
> >>>> Hi Rok,
> >>>>
> >>>> Happy to see you here :)
> >>>>
> >>>> According to my past experience, it would be more helpful to open
> >>>> a PR against the parquet-format repository and post it here.
> >>>>
> >>>> Best,
> >>>> Gang
> >>>>
> >>>> On Wed, May 15, 2024 at 7:25 PM Rok Mihevc <[email protected]>
> wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> Arrow recently introduced FixedShapeTensor and VariableShapeTensor
> >>>>> canonical extension types [1] that use FixedSizeList and
> >>>> StructArray(List,
> >>>>> FixedSizeList) as storage respectfully. These are targeted at machine
> >>>>> learning and scientific applications that deal with large datasets
> and
> >>>>> would benefit from using Parquet as on disk storage.
> >>>>>
> >>>>> However currently FixedSizeList is stored as List in Parquet which
> adds
> >>>>> significant conversion overhead when reading and writing [2]. It
> would
> >>>>> therefore be beneficial to introduce a FIXED_SIZE_LIST logical type.
> >>>>>
> >>>>> I would like to open a discussion on potentially adding
> FIXED_SIZE_LIST
> >>>>> type and prepare a proposal if discussion supports it.
> >>>>>
> >>>>>
> >>>>> Best,
> >>>>> Rok
> >>>>>
> >>>>> [1]
> >>>>>
> >>>>
> >>>
> https://arrow.apache.org/docs/format/CanonicalExtensions.html#official-list
> >>>>> [2] https://github.com/apache/arrow/issues/34510
> >>>>>
> >>>>
> >>>
> >>
> >
> >
> >
>
>
>

Re: [DISCUSSION] Alternative: introduce VECTOR repetition type

Reply via email to