Re: [DISCUSSION] Alternative: introduce VECTOR repetition type

Micah Kornfield Thu, 05 Mar 2026 22:13:11 -0800

>
>  As an alternative, we could perhaps add a new repetition type so that
> the physical type remains the actual child value type.



Just to summarize the two alternatives currently being discussed:
1.  Logical type that annotates FLBA
   - Pros:
      - If readers can properly skip unknown logical types, they won't fail.
   - Cons:
       - can't take advantage of all encodings for the underlying type
(BYTE_STREAM_SPLIT applies)
       - Can only be used at leafs

2.  Add a new repetition type (now there will be optional, required,
repeated and a new one "vector" (aka fixed size)).

  - Pros:
      - Can make use of native encoding types
  - Cons:
     - Not backwards compatible (most readers would probably fail at
decoding the footer with an unknown repetition type, and if not there when
trying to reconstruct the schema).
     - Adds complexity (especially if VECTOR types can be intermingled with
repeated).

My takes:
1.  I think option 1 is a pragmatic solution for a big gap in Parquet today
for newer AI workloads.  I also think it isn't a good end-state.
2.  Option 2 is suggesting a work-around for a broader symptom:
repetition/definition levels do have some benefits but they have a lot of
down-sides.  I would suggest that if we are willing to break compatibility,
we should consider moving away from repetition and definition levels, by
keeping nullability bitmaps/list length metadata in separate column
chunks.  I think this approach could be done incrementally by first writing
both pieces of metadata (repetition and definition levels), and then
ultimately consolidating on the new approach.  If people are interested I
can write up a more detailed proposal for this and how this could work.
This would certainly be a very large undertaking though.

Pragmatically, I would be in favor of  trying to push through option 1
(FLBA annotation) while trying to close on something that makes sense in
the longer term.


Rahil,

> If
> there is a board or a way to break down the work—perhaps on GitHub or the
> mailing list—I and others could help on some of the off tasks.


We generally use github issues to track  work if necessary.  But typically
for new types like this I think people communicate on an ad-hoc basis once
we have some level of consensus on the approach.

Cheers,
Micah




On Wed, Mar 4, 2026 at 7:32 AM Rok Mihevc <[email protected]> wrote:

> Thanks for the proposal Antoine!
> I prefer this suggestion [1] over the current PR [2]. I would be happy to
> drive it to completion if we get consensus that it is preferable.
> (Looking at past comments I get the feeling it might be)
>
> Looking forward to hearing from others on this!
>
> Rok
>
> [1]
>
> https://github.com/apache/parquet-format/compare/master...pitrou:vector-repetition
> [2] https://github.com/apache/parquet-format/pull/241
>
> On Wed, Mar 4, 2026 at 4:07 PM Rahil C <[email protected]> wrote:
>
> > Thanks Antoine and Rok for raising the respective spec change prs.
> >
> > I am interested in helping with this initiative specifically for
> > parquet-java, but I'd like to get more clarity from the Parquet community
> > on the implementation steps required once we align on the spec changes?
> If
> > there is a board or a way to break down the work—perhaps on GitHub or the
> > mailing list—I and others could help on some of the off tasks.
> >
> > Regards,
> > Rahil Chertara
> >
> > On Wed, Mar 4, 2026 at 1:38 AM Antoine Pitrou <[email protected]>
> wrote:
> >
> > >
> > > Hi,
> > >
> > > As an alternative, we could perhaps add a new repetition type so that
> > > the physical type remains the actual child value type.
> > >
> > > Here is a draft change against the Thrift definitions:
> > >
> > >
> >
> https://github.com/apache/parquet-format/compare/master...pitrou:vector-repetition
> > >
> > > I will not be able to work on this personally, so if it is deemed
> > > promising, someone else should take it up :-)
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 03/03/2026 à 20:57, Antoine Pitrou a écrit :
> > > >
> > > > Hello,
> > > >
> > > > The downside with this approach is that the top-level "unit" type is
> > not
> > > > the element type.
> > > >
> > > > For example, if you have a FIXED_SIZE_LIST(FLOAT32, 3), then the
> > > > top-level unit type is FIXED_LEN_BYTE_ARRAY(12). This means that
> > > > specialized encodings such as BYTE_STREAM_SPLIT, DELTA_BINARY_PACKED
> or
> > > > ALP may either be less efficient (for BYTE_STREAM_SPLIT) or not be
> > > > applicable at all (for the latter two).
> > > >
> > > > I wonder if we can find an approach that doesn't emit repetition
> levels
> > > > but still allows using efficient encodings for the element type.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 03/03/2026 à 01:13, Rok Mihevc a écrit :
> > > >> Hi all,
> > > >>
> > > >> I'd like to resurrect this thread in light of recent vectors in
> > Parquet
> > > >> discussion [1].
> > > >> There is a (now updated) proposal PR from when the thread was
> started
> > > that
> > > >> has a nice discussion [2].
> > > >>
> > > >> TLDR of the current proposal:
> > > >> - FIXED_SIZE_LIST annotates a FIXED_LEN_BYTE_ARRAY primitive leaf
> with
> > > >> FixedSizeListType { type, num_values }.
> > > >> - type must be fixed-width and non-array (INT32, INT64, FLOAT,
> DOUBLE,
> > > >> FIXED_LEN_BYTE_ARRAY); num_values > 0.
> > > >> - type_length must match num_values encoded with PLAIN
> representation
> > of
> > > >> type.
> > > >> - If the field is optional, the whole list value may be null;
> elements
> > > are
> > > >> always non-null.
> > > >> - Intentionally not a `LIST` encoding (no def/rep levels).
> > > >> - Outer page/column encoding behavior is unchanged (any encoding
> valid
> > > for
> > > >> `FIXED_LEN_BYTE_ARRAY` remains valid).
> > > >>
> > > >> I also added explicit validity requirements: writers must not emit
> > > >> violating metadata, and readers must treat violating metadata as
> > > invalid.
> > > >>
> > > >>
> > > >> Rok
> > > >>
> > > >> [1]
> https://lists.apache.org/thread/nmq7odlbg1p6yx0hg00clzjbc3tb1tc3
> > > >> [2] https://github.com/apache/parquet-format/pull/241
> > > >>
> > > >> On Thu, May 16, 2024 at 4:34 AM Jan Finis <[email protected]>
> wrote:
> > > >>
> > > >>> I would love to see this!
> > > >>>
> > > >>> It is an important optimization for vectors, which become more and
> > more
> > > >>> important and ubiquitous for grounding of LLMs.
> > > >>>
> > > >>> Note however that the logical type route has one drawback: A
> logical
> > > type
> > > >>> may not change the physical representation of values! Thus, if we
> > make
> > > >>> FIXED_SIZE_LIST just a logical type, we would still need to write
> > > R-Levels,
> > > >>> as even clients not knowing this logical type need to be able to
> > > decode the
> > > >>> column. We could avoid reading the R-Levels and just assume that
> each
> > > list
> > > >>> has the fixed size, so the read path would be optimized but the
> write
> > > path
> > > >>> wouldn't.
> > > >>>
> > > >>> If we want to avoid writing R-Levels altogether, a logical type
> > > doesn't cut
> > > >>> it. It needs to be something different. E.g., in the schema, we
> could
> > > store
> > > >>> an optional `count` for repeated fields. Whenever this count is
> > > present, we
> > > >>> would not write R-Levels for this field (or more precisely, this
> > field
> > > >>> would not take part in the R-Level computation, as if it wasn't a
> > > repeated
> > > >>> field). This of course is a more intrusive change, as legacy
> clients
> > > >>> couldn't read such columns anymore.
> > > >>>
> > > >>> I don't know which of the two alternatives is better. I agree with
> > Gang
> > > >>> that we should probably discuss this in a PR.
> > > >>>
> > > >>> Cheers,
> > > >>> Jan
> > > >>>
> > > >>>
> > > >>> Am Mi., 15. Mai 2024 um 14:03 Uhr schrieb Gang Wu <
> [email protected]
> > >:
> > > >>>
> > > >>>> Hi Rok,
> > > >>>>
> > > >>>> Happy to see you here :)
> > > >>>>
> > > >>>> According to my past experience, it would be more helpful to open
> > > >>>> a PR against the parquet-format repository and post it here.
> > > >>>>
> > > >>>> Best,
> > > >>>> Gang
> > > >>>>
> > > >>>> On Wed, May 15, 2024 at 7:25 PM Rok Mihevc <[email protected]>
> > > wrote:
> > > >>>>
> > > >>>>> Hi all,
> > > >>>>>
> > > >>>>> Arrow recently introduced FixedShapeTensor and
> VariableShapeTensor
> > > >>>>> canonical extension types [1] that use FixedSizeList and
> > > >>>> StructArray(List,
> > > >>>>> FixedSizeList) as storage respectfully. These are targeted at
> > machine
> > > >>>>> learning and scientific applications that deal with large
> datasets
> > > and
> > > >>>>> would benefit from using Parquet as on disk storage.
> > > >>>>>
> > > >>>>> However currently FixedSizeList is stored as List in Parquet
> which
> > > adds
> > > >>>>> significant conversion overhead when reading and writing [2]. It
> > > would
> > > >>>>> therefore be beneficial to introduce a FIXED_SIZE_LIST logical
> > type.
> > > >>>>>
> > > >>>>> I would like to open a discussion on potentially adding
> > > FIXED_SIZE_LIST
> > > >>>>> type and prepare a proposal if discussion supports it.
> > > >>>>>
> > > >>>>>
> > > >>>>> Best,
> > > >>>>> Rok
> > > >>>>>
> > > >>>>> [1]
> > > >>>>>
> > > >>>>
> > > >>>
> > >
> >
> https://arrow.apache.org/docs/format/CanonicalExtensions.html#official-list
> > > >>>>> [2] https://github.com/apache/arrow/issues/34510
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
>

Re: [DISCUSSION] Alternative: introduce VECTOR repetition type

Reply via email to