Hello Micah,

Let's call your proposal "option 3"? The problem is that while option 2 should be reasonably easy to implement, option 3 would be a massive undertaking (perhaps worse than the Flatbuffers adventure) with open questions about semantical equivalence between the old and new representations of optionality/repetition.

Let's also think about the compatibility issue. What is the main use case that people have in mind in this discussion? Would the FIXED_SIZE_LIST column(s) be an ancillary part of the data that can usefully be ignored, or it would be the *entire* data? For ML models at least it seems that ignoring the FIXED_SIZE_LIST columns would not make much sense, because the file would not contain anything else of value. Am I missing something?

Regards

Antoine.


Le 06/03/2026 à 07:12, Micah Kornfield a écrit :

  As an alternative, we could perhaps add a new repetition type so that
the physical type remains the actual child value type.

Just to summarize the two alternatives currently being discussed:
1.  Logical type that annotates FLBA
    - Pros:
       - If readers can properly skip unknown logical types, they won't fail.
    - Cons:
        - can't take advantage of all encodings for the underlying type
(BYTE_STREAM_SPLIT applies)
        - Can only be used at leafs

2.  Add a new repetition type (now there will be optional, required,
repeated and a new one "vector" (aka fixed size)).

   - Pros:
       - Can make use of native encoding types
   - Cons:
      - Not backwards compatible (most readers would probably fail at
decoding the footer with an unknown repetition type, and if not there when
trying to reconstruct the schema).
      - Adds complexity (especially if VECTOR types can be intermingled with
repeated).

My takes:
1.  I think option 1 is a pragmatic solution for a big gap in Parquet today
for newer AI workloads.  I also think it isn't a good end-state.
2.  Option 2 is suggesting a work-around for a broader symptom:
repetition/definition levels do have some benefits but they have a lot of
down-sides.  I would suggest that if we are willing to break compatibility,
we should consider moving away from repetition and definition levels, by
keeping nullability bitmaps/list length metadata in separate column
chunks.  I think this approach could be done incrementally by first writing
both pieces of metadata (repetition and definition levels), and then
ultimately consolidating on the new approach.  If people are interested I
can write up a more detailed proposal for this and how this could work.
This would certainly be a very large undertaking though.


Reply via email to