I would love to see this!

It is an important optimization for vectors, which become more and more
important and ubiquitous for grounding of LLMs.

Note however that the logical type route has one drawback: A logical type
may not change the physical representation of values! Thus, if we make
FIXED_SIZE_LIST just a logical type, we would still need to write R-Levels,
as even clients not knowing this logical type need to be able to decode the
column. We could avoid reading the R-Levels and just assume that each list
has the fixed size, so the read path would be optimized but the write path
wouldn't.

If we want to avoid writing R-Levels altogether, a logical type doesn't cut
it. It needs to be something different. E.g., in the schema, we could store
an optional `count` for repeated fields. Whenever this count is present, we
would not write R-Levels for this field (or more precisely, this field
would not take part in the R-Level computation, as if it wasn't a repeated
field). This of course is a more intrusive change, as legacy clients
couldn't read such columns anymore.

I don't know which of the two alternatives is better. I agree with Gang
that we should probably discuss this in a PR.

Cheers,
Jan


Am Mi., 15. Mai 2024 um 14:03 Uhr schrieb Gang Wu <ust...@gmail.com>:

> Hi Rok,
>
> Happy to see you here :)
>
> According to my past experience, it would be more helpful to open
> a PR against the parquet-format repository and post it here.
>
> Best,
> Gang
>
> On Wed, May 15, 2024 at 7:25 PM Rok Mihevc <rok.mih...@gmail.com> wrote:
>
> > Hi all,
> >
> > Arrow recently introduced FixedShapeTensor and VariableShapeTensor
> > canonical extension types [1] that use FixedSizeList and
> StructArray(List,
> > FixedSizeList) as storage respectfully. These are targeted at machine
> > learning and scientific applications that deal with large datasets and
> > would benefit from using Parquet as on disk storage.
> >
> > However currently FixedSizeList is stored as List in Parquet which adds
> > significant conversion overhead when reading and writing [2]. It would
> > therefore be beneficial to introduce a FIXED_SIZE_LIST logical type.
> >
> > I would like to open a discussion on potentially adding FIXED_SIZE_LIST
> > type and prepare a proposal if discussion supports it.
> >
> >
> > Best,
> > Rok
> >
> > [1]
> >
> https://arrow.apache.org/docs/format/CanonicalExtensions.html#official-list
> > [2] https://github.com/apache/arrow/issues/34510
> >
>

Reply via email to