Hello,

The downside with this approach is that the top-level "unit" type is not the element type.

For example, if you have a FIXED_SIZE_LIST(FLOAT32, 3), then the top-level unit type is FIXED_LEN_BYTE_ARRAY(12). This means that specialized encodings such as BYTE_STREAM_SPLIT, DELTA_BINARY_PACKED or ALP may either be less efficient (for BYTE_STREAM_SPLIT) or not be applicable at all (for the latter two).

I wonder if we can find an approach that doesn't emit repetition levels but still allows using efficient encodings for the element type.

Regards

Antoine.


Le 03/03/2026 à 01:13, Rok Mihevc a écrit :
Hi all,

I'd like to resurrect this thread in light of recent vectors in Parquet
discussion [1].
There is a (now updated) proposal PR from when the thread was started that
has a nice discussion [2].

TLDR of the current proposal:
- FIXED_SIZE_LIST annotates a FIXED_LEN_BYTE_ARRAY primitive leaf with
FixedSizeListType { type, num_values }.
- type must be fixed-width and non-array (INT32, INT64, FLOAT, DOUBLE,
FIXED_LEN_BYTE_ARRAY); num_values > 0.
- type_length must match num_values encoded with PLAIN representation of
type.
- If the field is optional, the whole list value may be null; elements are
always non-null.
- Intentionally not a `LIST` encoding (no def/rep levels).
- Outer page/column encoding behavior is unchanged (any encoding valid for
`FIXED_LEN_BYTE_ARRAY` remains valid).

I also added explicit validity requirements: writers must not emit
violating metadata, and readers must treat violating metadata as invalid.


Rok

[1] https://lists.apache.org/thread/nmq7odlbg1p6yx0hg00clzjbc3tb1tc3
[2] https://github.com/apache/parquet-format/pull/241

On Thu, May 16, 2024 at 4:34 AM Jan Finis <[email protected]> wrote:

I would love to see this!

It is an important optimization for vectors, which become more and more
important and ubiquitous for grounding of LLMs.

Note however that the logical type route has one drawback: A logical type
may not change the physical representation of values! Thus, if we make
FIXED_SIZE_LIST just a logical type, we would still need to write R-Levels,
as even clients not knowing this logical type need to be able to decode the
column. We could avoid reading the R-Levels and just assume that each list
has the fixed size, so the read path would be optimized but the write path
wouldn't.

If we want to avoid writing R-Levels altogether, a logical type doesn't cut
it. It needs to be something different. E.g., in the schema, we could store
an optional `count` for repeated fields. Whenever this count is present, we
would not write R-Levels for this field (or more precisely, this field
would not take part in the R-Level computation, as if it wasn't a repeated
field). This of course is a more intrusive change, as legacy clients
couldn't read such columns anymore.

I don't know which of the two alternatives is better. I agree with Gang
that we should probably discuss this in a PR.

Cheers,
Jan


Am Mi., 15. Mai 2024 um 14:03 Uhr schrieb Gang Wu <[email protected]>:

Hi Rok,

Happy to see you here :)

According to my past experience, it would be more helpful to open
a PR against the parquet-format repository and post it here.

Best,
Gang

On Wed, May 15, 2024 at 7:25 PM Rok Mihevc <[email protected]> wrote:

Hi all,

Arrow recently introduced FixedShapeTensor and VariableShapeTensor
canonical extension types [1] that use FixedSizeList and
StructArray(List,
FixedSizeList) as storage respectfully. These are targeted at machine
learning and scientific applications that deal with large datasets and
would benefit from using Parquet as on disk storage.

However currently FixedSizeList is stored as List in Parquet which adds
significant conversion overhead when reading and writing [2]. It would
therefore be beneficial to introduce a FIXED_SIZE_LIST logical type.

I would like to open a discussion on potentially adding FIXED_SIZE_LIST
type and prepare a proposal if discussion supports it.


Best,
Rok

[1]


https://arrow.apache.org/docs/format/CanonicalExtensions.html#official-list
[2] https://github.com/apache/arrow/issues/34510






Reply via email to