Option 1 - current FIXED_SIZE_LIST proposal Option 2 - introduce VECTOR repetition type Option 3 - move nullability into a new column, move towards gradually removing definition/repetition levels ---
Hi all, great to see new ideas and appetite to support vector-like data. How much extra work do we expect Option 2 would be compared to Option 1? I suppose at the minimum readers would have to be aware of the new repetition type so they can safely ignore it, which IIUC would be most of the extra work compared to Option 1. (All implementations would obviously need more changes to read/write new types. But changes would be of similar magnitude.) Am I missing something? Options 1 and 2 are not blockers for Option 3. If we come to the conclusion Option 2 is not feasible now we are then picking between two long term efforts - Option 2 vs Option 3. So if we don't agree that Option 2 is a good idea now, I would propose we start on Option 1 and begin a separate discussion on Option 2 vs Option 3. Rok On Fri, Mar 6, 2026 at 9:27 AM Antoine Pitrou <[email protected]> wrote: > > Hello Micah, > > Let's call your proposal "option 3"? The problem is that while option 2 > should be reasonably easy to implement, option 3 would be a massive > undertaking (perhaps worse than the Flatbuffers adventure) with open > questions about semantical equivalence between the old and new > representations of optionality/repetition. > > Let's also think about the compatibility issue. What is the main use > case that people have in mind in this discussion? Would the > FIXED_SIZE_LIST column(s) be an ancillary part of the data that can > usefully be ignored, or it would be the *entire* data? For ML models at > least it seems that ignoring the FIXED_SIZE_LIST columns would not make > much sense, because the file would not contain anything else of value. > Am I missing something? > > Regards > > Antoine. > > > Le 06/03/2026 à 07:12, Micah Kornfield a écrit : > >> > >> As an alternative, we could perhaps add a new repetition type so that > >> the physical type remains the actual child value type. > > > > Just to summarize the two alternatives currently being discussed: > > 1. Logical type that annotates FLBA > > - Pros: > > - If readers can properly skip unknown logical types, they won't > fail. > > - Cons: > > - can't take advantage of all encodings for the underlying type > > (BYTE_STREAM_SPLIT applies) > > - Can only be used at leafs > > > > 2. Add a new repetition type (now there will be optional, required, > > repeated and a new one "vector" (aka fixed size)). > > > > - Pros: > > - Can make use of native encoding types > > - Cons: > > - Not backwards compatible (most readers would probably fail at > > decoding the footer with an unknown repetition type, and if not there > when > > trying to reconstruct the schema). > > - Adds complexity (especially if VECTOR types can be intermingled > with > > repeated). > > > > My takes: > > 1. I think option 1 is a pragmatic solution for a big gap in Parquet > today > > for newer AI workloads. I also think it isn't a good end-state. > > 2. Option 2 is suggesting a work-around for a broader symptom: > > repetition/definition levels do have some benefits but they have a lot of > > down-sides. I would suggest that if we are willing to break > compatibility, > > we should consider moving away from repetition and definition levels, by > > keeping nullability bitmaps/list length metadata in separate column > > chunks. I think this approach could be done incrementally by first > writing > > both pieces of metadata (repetition and definition levels), and then > > ultimately consolidating on the new approach. If people are interested I > > can write up a more detailed proposal for this and how this could work. > > This would certainly be a very large undertaking though. > > >
