Hello Micah,
Let's call your proposal "option 3"? The problem is that while option 2
should be reasonably easy to implement, option 3 would be a massive
undertaking (perhaps worse than the Flatbuffers adventure) with open
questions about semantical equivalence between the old and new
representations of optionality/repetition.
Let's also think about the compatibility issue. What is the main use
case that people have in mind in this discussion? Would the
FIXED_SIZE_LIST column(s) be an ancillary part of the data that can
usefully be ignored, or it would be the *entire* data? For ML models at
least it seems that ignoring the FIXED_SIZE_LIST columns would not make
much sense, because the file would not contain anything else of value.
Am I missing something?
Regards
Antoine.
Le 06/03/2026 à 07:12, Micah Kornfield a écrit :
As an alternative, we could perhaps add a new repetition type so that
the physical type remains the actual child value type.
Just to summarize the two alternatives currently being discussed:
1. Logical type that annotates FLBA
- Pros:
- If readers can properly skip unknown logical types, they won't fail.
- Cons:
- can't take advantage of all encodings for the underlying type
(BYTE_STREAM_SPLIT applies)
- Can only be used at leafs
2. Add a new repetition type (now there will be optional, required,
repeated and a new one "vector" (aka fixed size)).
- Pros:
- Can make use of native encoding types
- Cons:
- Not backwards compatible (most readers would probably fail at
decoding the footer with an unknown repetition type, and if not there when
trying to reconstruct the schema).
- Adds complexity (especially if VECTOR types can be intermingled with
repeated).
My takes:
1. I think option 1 is a pragmatic solution for a big gap in Parquet today
for newer AI workloads. I also think it isn't a good end-state.
2. Option 2 is suggesting a work-around for a broader symptom:
repetition/definition levels do have some benefits but they have a lot of
down-sides. I would suggest that if we are willing to break compatibility,
we should consider moving away from repetition and definition levels, by
keeping nullability bitmaps/list length metadata in separate column
chunks. I think this approach could be done incrementally by first writing
both pieces of metadata (repetition and definition levels), and then
ultimately consolidating on the new approach. If people are interested I
can write up a more detailed proposal for this and how this could work.
This would certainly be a very large undertaking though.