emkornfield commented on code in PR #250:
URL: https://github.com/apache/parquet-format/pull/250#discussion_r1621186564
##########
src/main/thrift/parquet.thrift:
##########
@@ -1127,18 +1229,48 @@ struct FileMetaData {
* are flattened to a list by doing a depth-first traversal.
* The column metadata contains the path in the schema for that column which
can be
* used to map columns to nodes in the schema.
- * The first element is the root **/
- 2: required list<SchemaElement> schema;
+ * The first element is the root
+ *
+ * PAR1: Required
+ * PAR3: Use schema_page
+ **/
+ 2: optional list<SchemaElement> schema;
+
+ /** Page has BYTE_ARRAY data where each element is REQUIRED.
+ *
+ * Each element is a serialized SchemaElement. The order and content should
+ * have a one to one correspondence with schema.
+ */
+ 10: optional MetadataPage schema_page;
Review Comment:
A little bit I had a few ideas that I haven't had to write down. Starting
from the easiest:
1. For formats like Iceberg that use field_id three columns (exact physical
layout TBD): <sorted field IDs><schema index><leaf column index>
2. For name based indexes, I was thinking of flattened list of schema
elements, that is flattened in bread-first order. At each level we sort the
column names (I think there are likely two colations that are might be useful,
normal lexicographic, and case-insensitive (all column names are normalized to
lower case). The column names are associated with an offset and width to there
children. There are potentially more complex data structures that could make
this more efficient but it seems like a reasonable start.
In both cases I think we need to potentially make a decision if this is 1 to
1 or 1 to many (e.g. I've seen parquet files with the duplicate column names
that differ only by whether they are upper case or lower case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]