Hi all,

I'd like to resurrect this thread in light of recent vectors in Parquet
discussion [1].
There is a (now updated) proposal PR from when the thread was started that
has a nice discussion [2].

TLDR of the current proposal:
- FIXED_SIZE_LIST annotates a FIXED_LEN_BYTE_ARRAY primitive leaf with
FixedSizeListType { type, num_values }.
- type must be fixed-width and non-array (INT32, INT64, FLOAT, DOUBLE,
FIXED_LEN_BYTE_ARRAY); num_values > 0.
- type_length must match num_values encoded with PLAIN representation of
type.
- If the field is optional, the whole list value may be null; elements are
always non-null.
- Intentionally not a `LIST` encoding (no def/rep levels).
- Outer page/column encoding behavior is unchanged (any encoding valid for
`FIXED_LEN_BYTE_ARRAY` remains valid).

I also added explicit validity requirements: writers must not emit
violating metadata, and readers must treat violating metadata as invalid.


Rok

[1] https://lists.apache.org/thread/nmq7odlbg1p6yx0hg00clzjbc3tb1tc3
[2] https://github.com/apache/parquet-format/pull/241

On Thu, May 16, 2024 at 4:34 AM Jan Finis <[email protected]> wrote:

> I would love to see this!
>
> It is an important optimization for vectors, which become more and more
> important and ubiquitous for grounding of LLMs.
>
> Note however that the logical type route has one drawback: A logical type
> may not change the physical representation of values! Thus, if we make
> FIXED_SIZE_LIST just a logical type, we would still need to write R-Levels,
> as even clients not knowing this logical type need to be able to decode the
> column. We could avoid reading the R-Levels and just assume that each list
> has the fixed size, so the read path would be optimized but the write path
> wouldn't.
>
> If we want to avoid writing R-Levels altogether, a logical type doesn't cut
> it. It needs to be something different. E.g., in the schema, we could store
> an optional `count` for repeated fields. Whenever this count is present, we
> would not write R-Levels for this field (or more precisely, this field
> would not take part in the R-Level computation, as if it wasn't a repeated
> field). This of course is a more intrusive change, as legacy clients
> couldn't read such columns anymore.
>
> I don't know which of the two alternatives is better. I agree with Gang
> that we should probably discuss this in a PR.
>
> Cheers,
> Jan
>
>
> Am Mi., 15. Mai 2024 um 14:03 Uhr schrieb Gang Wu <[email protected]>:
>
> > Hi Rok,
> >
> > Happy to see you here :)
> >
> > According to my past experience, it would be more helpful to open
> > a PR against the parquet-format repository and post it here.
> >
> > Best,
> > Gang
> >
> > On Wed, May 15, 2024 at 7:25 PM Rok Mihevc <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > Arrow recently introduced FixedShapeTensor and VariableShapeTensor
> > > canonical extension types [1] that use FixedSizeList and
> > StructArray(List,
> > > FixedSizeList) as storage respectfully. These are targeted at machine
> > > learning and scientific applications that deal with large datasets and
> > > would benefit from using Parquet as on disk storage.
> > >
> > > However currently FixedSizeList is stored as List in Parquet which adds
> > > significant conversion overhead when reading and writing [2]. It would
> > > therefore be beneficial to introduce a FIXED_SIZE_LIST logical type.
> > >
> > > I would like to open a discussion on potentially adding FIXED_SIZE_LIST
> > > type and prepare a proposal if discussion supports it.
> > >
> > >
> > > Best,
> > > Rok
> > >
> > > [1]
> > >
> >
> https://arrow.apache.org/docs/format/CanonicalExtensions.html#official-list
> > > [2] https://github.com/apache/arrow/issues/34510
> > >
> >
>

Reply via email to