rok opened a new issue, #855:
URL: https://github.com/apache/arrow-go/issues/855

   ## Problem
   
   Arrow `FixedSizeList<T, N>` is the natural type for fixed-shape data — 
embeddings, image/tensor patches, fixed-precision decimal vectors — where every 
value has exactly `N` elements and the shape is fixed and known from the 
schema. Today pqarrow round-trips it through Parquet as a standard 3-level 
`LIST`, writing per-element repetition and definition levels for a length that 
never varies. For wide dense vectors that is pure overhead; apache/arrow#34510 
measured a ~3x read gap that motivates a denser encoding.
   
   ## Proposal
   
   Add an experimental Parquet `VECTOR` `FieldRepetitionType` that stores a 
fixed number of element values per row directly, without per-element rep/def 
levels, and map Arrow `FixedSizeList` onto it. This is the "Option B" design 
from the *Fixed-size list type for Parquet* proposal (and the arrow-cpp 
prototype, rok/arrow#51).
   
   A reduced, **leaf-only** first phase:
   
   - A `VECTOR` column is a single primitive leaf carrying `vector_length`: 
`vector <element-type> <name> [N];` — not a nested group.
   - Only dense, non-nullable, top-level `FixedSizeList` columns with a 
fixed-width primitive element are encoded as `VECTOR`. Everything else 
(nullable value or element, zero-length, 
variable-width/dictionary/extension/struct/nested-list element, or a nested 
`FixedSizeList`) transparently falls back to the standard `LIST` encoding. 
Nullable, struct, and nested vectors are follow-ups.
   - Opt-in on the writer via `pqarrow.WithVectorEncoding()`; reading is 
automatic.
   
   Format additions (not yet in apache/parquet-format): 
`FieldRepetitionType.VECTOR = 3` and `SchemaElement.vector_length` (field id 
12).
   
   ## Caveat
   
   `VECTOR` is not part of apache/parquet-format yet, so this is strictly 
opt-in and non-portable: files written with `VECTOR` are rejected by readers 
that don't understand the repetition type.
   
   ## References
   
   - *Fixed-size list type for Parquet* design proposal
   - apache/arrow#34510 — measured ~3x read gap
   - arrow-cpp Option B prototype: rok/arrow#51
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to