hcrosse opened a new issue, #722: URL: https://github.com/apache/arrow-go/issues/722
### Describe the bug, including details regarding any error messages, version, and platform. arrow-go writes `REPEATED` as the `repetition_type` for the root `SchemaElement` in the Parquet Thrift footer. I think this is non-standard and it's caused some interoperability failures for me. The default `rootRepetition` in [`WriterProperties`](https://github.com/apache/arrow-go/blob/main/parquet/writer_properties.go#L519) is `Repetitions.Repeated`. While `WithRootRepetition` exists as an opt-in override, the default itself is non-standard, and consumers of arrow-go (like [apache/iceberg-go](https://github.com/apache/iceberg-go)) inherit this default and may not expose ways to modify it. Per the [Parquet format spec](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L516-L518): > "The root of the schema does not have a repetition_type. All other nodes must have one." The `repetition_type` field on `SchemaElement` is `optional` in the Thrift definition specifically because the root should not carry one. Among the Parquet implementations I checked, arrow-go is the only one that writes `REPEATED` into the Thrift footer for the root element: | Implementation | In-memory | On disk (Thrift footer) | Source | |---|---|---|---| | **Parquet spec** | N/A | Not set | [parquet.thrift#L516-L518](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L516-L518) | | **parquet-java** | `REPEATED` | **Not set** (stripped during serialization) | [MessageType.java#L36](https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/MessageType.java#L36), [ParquetMetadataConverter.java#L323-L329](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L323-L329) | | **Arrow C++ / pyarrow** | `REQUIRED` | `REQUIRED` | [schema.cc#L1228](https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1228) | | **arrow-rs (Rust)** | `None` | **Not set** | [types.rs#L45-L46](https://github.com/apache/arrow-rs/blob/main/parquet/src/schema/types.rs#L45-L46), [types.rs#L590-L591](https://github.com/apache/arrow-rs/blob/main/parquet/src/schema/types.rs#L590-L591) | | **arrow-go** | `REPEATED` | **`REPEATED`** | [writer_properties.go#L519](https://github.com/apache/arrow-go/blob/main/parquet/writer_properties.go#L519) | For added context, arrow-rs explicitly tolerates and strips root repetition when reading files from other implementations ([types.rs#L1383-L1396](https://github.com/apache/arrow-rs/blob/main/parquet/src/schema/types.rs#L1383-L1396)). In my specific example, Snowflake rejects Parquet files written with `REPEATED` root repetition when they contain list columns: > "List encoding is not supported. List encoding: '0'" I was able to reproduce this consistently: any iceberg-go table with list columns fails to load in Snowflake when the root schema element has `REPEATED` repetition. A couple possible fixes: 1. Don't serialize `repetition_type` for the root `SchemaElement` at all, matching parquet-java and arrow-rs behavior and the Parquet spec exactly. The `WithRootRepetition` option and in-memory representation would be unaffected. 2. Change the default to `Repetitions.Required`, matching Arrow C++/pyarrow. Less spec-pure but a smaller change. Either way, the existing `WithRootRepetition` API will remain available for anyone who needs to override the behavior. I'm happy to submit a PR for whichever approach is preferred if either of these sound good. ### Component(s) Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
