Thanks for the proposal Antoine!
I prefer this suggestion [1] over the current PR [2]. I would be happy to
drive it to completion if we get consensus that it is preferable.
(Looking at past comments I get the feeling it might be)

Looking forward to hearing from others on this!

Rok

[1]
https://github.com/apache/parquet-format/compare/master...pitrou:vector-repetition
[2] https://github.com/apache/parquet-format/pull/241

On Wed, Mar 4, 2026 at 4:07 PM Rahil C <[email protected]> wrote:

> Thanks Antoine and Rok for raising the respective spec change prs.
>
> I am interested in helping with this initiative specifically for
> parquet-java, but I'd like to get more clarity from the Parquet community
> on the implementation steps required once we align on the spec changes? If
> there is a board or a way to break down the work—perhaps on GitHub or the
> mailing list—I and others could help on some of the off tasks.
>
> Regards,
> Rahil Chertara
>
> On Wed, Mar 4, 2026 at 1:38 AM Antoine Pitrou <[email protected]> wrote:
>
> >
> > Hi,
> >
> > As an alternative, we could perhaps add a new repetition type so that
> > the physical type remains the actual child value type.
> >
> > Here is a draft change against the Thrift definitions:
> >
> >
> https://github.com/apache/parquet-format/compare/master...pitrou:vector-repetition
> >
> > I will not be able to work on this personally, so if it is deemed
> > promising, someone else should take it up :-)
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 03/03/2026 à 20:57, Antoine Pitrou a écrit :
> > >
> > > Hello,
> > >
> > > The downside with this approach is that the top-level "unit" type is
> not
> > > the element type.
> > >
> > > For example, if you have a FIXED_SIZE_LIST(FLOAT32, 3), then the
> > > top-level unit type is FIXED_LEN_BYTE_ARRAY(12). This means that
> > > specialized encodings such as BYTE_STREAM_SPLIT, DELTA_BINARY_PACKED or
> > > ALP may either be less efficient (for BYTE_STREAM_SPLIT) or not be
> > > applicable at all (for the latter two).
> > >
> > > I wonder if we can find an approach that doesn't emit repetition levels
> > > but still allows using efficient encodings for the element type.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 03/03/2026 à 01:13, Rok Mihevc a écrit :
> > >> Hi all,
> > >>
> > >> I'd like to resurrect this thread in light of recent vectors in
> Parquet
> > >> discussion [1].
> > >> There is a (now updated) proposal PR from when the thread was started
> > that
> > >> has a nice discussion [2].
> > >>
> > >> TLDR of the current proposal:
> > >> - FIXED_SIZE_LIST annotates a FIXED_LEN_BYTE_ARRAY primitive leaf with
> > >> FixedSizeListType { type, num_values }.
> > >> - type must be fixed-width and non-array (INT32, INT64, FLOAT, DOUBLE,
> > >> FIXED_LEN_BYTE_ARRAY); num_values > 0.
> > >> - type_length must match num_values encoded with PLAIN representation
> of
> > >> type.
> > >> - If the field is optional, the whole list value may be null; elements
> > are
> > >> always non-null.
> > >> - Intentionally not a `LIST` encoding (no def/rep levels).
> > >> - Outer page/column encoding behavior is unchanged (any encoding valid
> > for
> > >> `FIXED_LEN_BYTE_ARRAY` remains valid).
> > >>
> > >> I also added explicit validity requirements: writers must not emit
> > >> violating metadata, and readers must treat violating metadata as
> > invalid.
> > >>
> > >>
> > >> Rok
> > >>
> > >> [1] https://lists.apache.org/thread/nmq7odlbg1p6yx0hg00clzjbc3tb1tc3
> > >> [2] https://github.com/apache/parquet-format/pull/241
> > >>
> > >> On Thu, May 16, 2024 at 4:34 AM Jan Finis <[email protected]> wrote:
> > >>
> > >>> I would love to see this!
> > >>>
> > >>> It is an important optimization for vectors, which become more and
> more
> > >>> important and ubiquitous for grounding of LLMs.
> > >>>
> > >>> Note however that the logical type route has one drawback: A logical
> > type
> > >>> may not change the physical representation of values! Thus, if we
> make
> > >>> FIXED_SIZE_LIST just a logical type, we would still need to write
> > R-Levels,
> > >>> as even clients not knowing this logical type need to be able to
> > decode the
> > >>> column. We could avoid reading the R-Levels and just assume that each
> > list
> > >>> has the fixed size, so the read path would be optimized but the write
> > path
> > >>> wouldn't.
> > >>>
> > >>> If we want to avoid writing R-Levels altogether, a logical type
> > doesn't cut
> > >>> it. It needs to be something different. E.g., in the schema, we could
> > store
> > >>> an optional `count` for repeated fields. Whenever this count is
> > present, we
> > >>> would not write R-Levels for this field (or more precisely, this
> field
> > >>> would not take part in the R-Level computation, as if it wasn't a
> > repeated
> > >>> field). This of course is a more intrusive change, as legacy clients
> > >>> couldn't read such columns anymore.
> > >>>
> > >>> I don't know which of the two alternatives is better. I agree with
> Gang
> > >>> that we should probably discuss this in a PR.
> > >>>
> > >>> Cheers,
> > >>> Jan
> > >>>
> > >>>
> > >>> Am Mi., 15. Mai 2024 um 14:03 Uhr schrieb Gang Wu <[email protected]
> >:
> > >>>
> > >>>> Hi Rok,
> > >>>>
> > >>>> Happy to see you here :)
> > >>>>
> > >>>> According to my past experience, it would be more helpful to open
> > >>>> a PR against the parquet-format repository and post it here.
> > >>>>
> > >>>> Best,
> > >>>> Gang
> > >>>>
> > >>>> On Wed, May 15, 2024 at 7:25 PM Rok Mihevc <[email protected]>
> > wrote:
> > >>>>
> > >>>>> Hi all,
> > >>>>>
> > >>>>> Arrow recently introduced FixedShapeTensor and VariableShapeTensor
> > >>>>> canonical extension types [1] that use FixedSizeList and
> > >>>> StructArray(List,
> > >>>>> FixedSizeList) as storage respectfully. These are targeted at
> machine
> > >>>>> learning and scientific applications that deal with large datasets
> > and
> > >>>>> would benefit from using Parquet as on disk storage.
> > >>>>>
> > >>>>> However currently FixedSizeList is stored as List in Parquet which
> > adds
> > >>>>> significant conversion overhead when reading and writing [2]. It
> > would
> > >>>>> therefore be beneficial to introduce a FIXED_SIZE_LIST logical
> type.
> > >>>>>
> > >>>>> I would like to open a discussion on potentially adding
> > FIXED_SIZE_LIST
> > >>>>> type and prepare a proposal if discussion supports it.
> > >>>>>
> > >>>>>
> > >>>>> Best,
> > >>>>> Rok
> > >>>>>
> > >>>>> [1]
> > >>>>>
> > >>>>
> > >>>
> >
> https://arrow.apache.org/docs/format/CanonicalExtensions.html#official-list
> > >>>>> [2] https://github.com/apache/arrow/issues/34510
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >
> > >
> > >
> >
> >
> >
>

Reply via email to