I would love to see this! It is an important optimization for vectors, which become more and more important and ubiquitous for grounding of LLMs.
Note however that the logical type route has one drawback: A logical type may not change the physical representation of values! Thus, if we make FIXED_SIZE_LIST just a logical type, we would still need to write R-Levels, as even clients not knowing this logical type need to be able to decode the column. We could avoid reading the R-Levels and just assume that each list has the fixed size, so the read path would be optimized but the write path wouldn't. If we want to avoid writing R-Levels altogether, a logical type doesn't cut it. It needs to be something different. E.g., in the schema, we could store an optional `count` for repeated fields. Whenever this count is present, we would not write R-Levels for this field (or more precisely, this field would not take part in the R-Level computation, as if it wasn't a repeated field). This of course is a more intrusive change, as legacy clients couldn't read such columns anymore. I don't know which of the two alternatives is better. I agree with Gang that we should probably discuss this in a PR. Cheers, Jan Am Mi., 15. Mai 2024 um 14:03 Uhr schrieb Gang Wu <ust...@gmail.com>: > Hi Rok, > > Happy to see you here :) > > According to my past experience, it would be more helpful to open > a PR against the parquet-format repository and post it here. > > Best, > Gang > > On Wed, May 15, 2024 at 7:25 PM Rok Mihevc <rok.mih...@gmail.com> wrote: > > > Hi all, > > > > Arrow recently introduced FixedShapeTensor and VariableShapeTensor > > canonical extension types [1] that use FixedSizeList and > StructArray(List, > > FixedSizeList) as storage respectfully. These are targeted at machine > > learning and scientific applications that deal with large datasets and > > would benefit from using Parquet as on disk storage. > > > > However currently FixedSizeList is stored as List in Parquet which adds > > significant conversion overhead when reading and writing [2]. It would > > therefore be beneficial to introduce a FIXED_SIZE_LIST logical type. > > > > I would like to open a discussion on potentially adding FIXED_SIZE_LIST > > type and prepare a proposal if discussion supports it. > > > > > > Best, > > Rok > > > > [1] > > > https://arrow.apache.org/docs/format/CanonicalExtensions.html#official-list > > [2] https://github.com/apache/arrow/issues/34510 > > >