Thanks for the proposal Antoine! I prefer this suggestion [1] over the current PR [2]. I would be happy to drive it to completion if we get consensus that it is preferable. (Looking at past comments I get the feeling it might be)
Looking forward to hearing from others on this! Rok [1] https://github.com/apache/parquet-format/compare/master...pitrou:vector-repetition [2] https://github.com/apache/parquet-format/pull/241 On Wed, Mar 4, 2026 at 4:07 PM Rahil C <[email protected]> wrote: > Thanks Antoine and Rok for raising the respective spec change prs. > > I am interested in helping with this initiative specifically for > parquet-java, but I'd like to get more clarity from the Parquet community > on the implementation steps required once we align on the spec changes? If > there is a board or a way to break down the work—perhaps on GitHub or the > mailing list—I and others could help on some of the off tasks. > > Regards, > Rahil Chertara > > On Wed, Mar 4, 2026 at 1:38 AM Antoine Pitrou <[email protected]> wrote: > > > > > Hi, > > > > As an alternative, we could perhaps add a new repetition type so that > > the physical type remains the actual child value type. > > > > Here is a draft change against the Thrift definitions: > > > > > https://github.com/apache/parquet-format/compare/master...pitrou:vector-repetition > > > > I will not be able to work on this personally, so if it is deemed > > promising, someone else should take it up :-) > > > > Regards > > > > Antoine. > > > > > > Le 03/03/2026 à 20:57, Antoine Pitrou a écrit : > > > > > > Hello, > > > > > > The downside with this approach is that the top-level "unit" type is > not > > > the element type. > > > > > > For example, if you have a FIXED_SIZE_LIST(FLOAT32, 3), then the > > > top-level unit type is FIXED_LEN_BYTE_ARRAY(12). This means that > > > specialized encodings such as BYTE_STREAM_SPLIT, DELTA_BINARY_PACKED or > > > ALP may either be less efficient (for BYTE_STREAM_SPLIT) or not be > > > applicable at all (for the latter two). > > > > > > I wonder if we can find an approach that doesn't emit repetition levels > > > but still allows using efficient encodings for the element type. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > Le 03/03/2026 à 01:13, Rok Mihevc a écrit : > > >> Hi all, > > >> > > >> I'd like to resurrect this thread in light of recent vectors in > Parquet > > >> discussion [1]. > > >> There is a (now updated) proposal PR from when the thread was started > > that > > >> has a nice discussion [2]. > > >> > > >> TLDR of the current proposal: > > >> - FIXED_SIZE_LIST annotates a FIXED_LEN_BYTE_ARRAY primitive leaf with > > >> FixedSizeListType { type, num_values }. > > >> - type must be fixed-width and non-array (INT32, INT64, FLOAT, DOUBLE, > > >> FIXED_LEN_BYTE_ARRAY); num_values > 0. > > >> - type_length must match num_values encoded with PLAIN representation > of > > >> type. > > >> - If the field is optional, the whole list value may be null; elements > > are > > >> always non-null. > > >> - Intentionally not a `LIST` encoding (no def/rep levels). > > >> - Outer page/column encoding behavior is unchanged (any encoding valid > > for > > >> `FIXED_LEN_BYTE_ARRAY` remains valid). > > >> > > >> I also added explicit validity requirements: writers must not emit > > >> violating metadata, and readers must treat violating metadata as > > invalid. > > >> > > >> > > >> Rok > > >> > > >> [1] https://lists.apache.org/thread/nmq7odlbg1p6yx0hg00clzjbc3tb1tc3 > > >> [2] https://github.com/apache/parquet-format/pull/241 > > >> > > >> On Thu, May 16, 2024 at 4:34 AM Jan Finis <[email protected]> wrote: > > >> > > >>> I would love to see this! > > >>> > > >>> It is an important optimization for vectors, which become more and > more > > >>> important and ubiquitous for grounding of LLMs. > > >>> > > >>> Note however that the logical type route has one drawback: A logical > > type > > >>> may not change the physical representation of values! Thus, if we > make > > >>> FIXED_SIZE_LIST just a logical type, we would still need to write > > R-Levels, > > >>> as even clients not knowing this logical type need to be able to > > decode the > > >>> column. We could avoid reading the R-Levels and just assume that each > > list > > >>> has the fixed size, so the read path would be optimized but the write > > path > > >>> wouldn't. > > >>> > > >>> If we want to avoid writing R-Levels altogether, a logical type > > doesn't cut > > >>> it. It needs to be something different. E.g., in the schema, we could > > store > > >>> an optional `count` for repeated fields. Whenever this count is > > present, we > > >>> would not write R-Levels for this field (or more precisely, this > field > > >>> would not take part in the R-Level computation, as if it wasn't a > > repeated > > >>> field). This of course is a more intrusive change, as legacy clients > > >>> couldn't read such columns anymore. > > >>> > > >>> I don't know which of the two alternatives is better. I agree with > Gang > > >>> that we should probably discuss this in a PR. > > >>> > > >>> Cheers, > > >>> Jan > > >>> > > >>> > > >>> Am Mi., 15. Mai 2024 um 14:03 Uhr schrieb Gang Wu <[email protected] > >: > > >>> > > >>>> Hi Rok, > > >>>> > > >>>> Happy to see you here :) > > >>>> > > >>>> According to my past experience, it would be more helpful to open > > >>>> a PR against the parquet-format repository and post it here. > > >>>> > > >>>> Best, > > >>>> Gang > > >>>> > > >>>> On Wed, May 15, 2024 at 7:25 PM Rok Mihevc <[email protected]> > > wrote: > > >>>> > > >>>>> Hi all, > > >>>>> > > >>>>> Arrow recently introduced FixedShapeTensor and VariableShapeTensor > > >>>>> canonical extension types [1] that use FixedSizeList and > > >>>> StructArray(List, > > >>>>> FixedSizeList) as storage respectfully. These are targeted at > machine > > >>>>> learning and scientific applications that deal with large datasets > > and > > >>>>> would benefit from using Parquet as on disk storage. > > >>>>> > > >>>>> However currently FixedSizeList is stored as List in Parquet which > > adds > > >>>>> significant conversion overhead when reading and writing [2]. It > > would > > >>>>> therefore be beneficial to introduce a FIXED_SIZE_LIST logical > type. > > >>>>> > > >>>>> I would like to open a discussion on potentially adding > > FIXED_SIZE_LIST > > >>>>> type and prepare a proposal if discussion supports it. > > >>>>> > > >>>>> > > >>>>> Best, > > >>>>> Rok > > >>>>> > > >>>>> [1] > > >>>>> > > >>>> > > >>> > > > https://arrow.apache.org/docs/format/CanonicalExtensions.html#official-list > > >>>>> [2] https://github.com/apache/arrow/issues/34510 > > >>>>> > > >>>> > > >>> > > >> > > > > > > > > > > > > > > > >
