JFinis commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1602675688


##########
src/main/thrift/parquet.thrift:
##########
@@ -282,13 +282,14 @@ struct Statistics {
 }
 
 /** Empty structs to use as logical type annotations */
-struct StringType {}  // allowed for BINARY, must be encoded with UTF-8
-struct UUIDType {}    // allowed for FIXED[16], must encoded raw UUID bytes
-struct MapType {}     // see LogicalTypes.md
-struct ListType {}    // see LogicalTypes.md
-struct EnumType {}    // allowed for BINARY, must be encoded with UTF-8
-struct DateType {}    // allowed for INT32
-struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
+struct StringType {}        // allowed for BINARY, must be encoded with UTF-8
+struct UUIDType {}          // allowed for FIXED[16], must encoded raw UUID 
bytes
+struct MapType {}           // see LogicalTypes.md
+struct ListType {}          // see LogicalTypes.md
+struct EnumType {}          // allowed for BINARY, must be encoded with UTF-8
+struct DateType {}          // allowed for INT32
+struct Float16Type {}       // allowed for FIXED[2], must encoded raw FLOAT16 
bytes
+struct FixedSizeListType {} // see LogicalTypes.md

Review Comment:
   Something is missing here. Shouldn't this type contain the element type? And 
the length of the list? The length of the list could be deduced from the size 
of the underlying `fixed_len_byte_array`, but at least the element type would 
be necessary then.



##########
LogicalTypes.md:
##########
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST

Review Comment:
   Interesting choice to annotate a binary primitive field instead of a 
repeated group field. I see pros and cons with this design:
   
   PROs:
   * Guarantees zero-copy, as the layout is defined to be just bytes. In 
contrast, would this annotate a group, a writer could decide to use a fancy 
per-value encoding (e.g., dictionary) and thus create a list that first has to 
be "decoded" before it can be used.
   * Guarantees that a list is always contained on one page instead of being 
split over multiple pages. Again, this helps in keeping decoders easy and 
guaranteeing zero copy.
   * This solves the problem of redundant R-Levels. Since it's just a primitive 
column, no r-level considerations have to be taken into account.
   
   CONs:
   * Cannot create fixed size lists of nested types (e.g., list of structs). I 
see that this isn't necessary for tensors or embedding vectors, but shouldn't 
the feature be extensible for other scenarios as well? This limits the 
composability of the feature. I can now create a struct of fixed size lists, 
but not a fixed size list of structs.
   * Parquet has a concept for (non-fixed size) lists. It is conceptually weird 
that fixed size lists are totally different from (non-fixed size) lists.
   
   I think the PROs outweigh the CONs here, so I think this is fine with me. I 
just want everyone to be aware about the ramifications.



##########
LogicalTypes.md:
##########
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a primitive data type. It must annotate a `binary` primitive type.

Review Comment:
   `binary` isn't a defined primitive type in Parquet. Do you mean 
FIXED_LEN_BYTE_ARRAY?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to