[
https://issues.apache.org/jira/browse/FLINK-35620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855098#comment-17855098
]
Vicky Papavasileiou commented on FLINK-35620:
---------------------------------------------
[~lzljs3620320] [~jingge] FYI: The previous PR
[https://github.com/apache/flink/pull/24795] did not completely address the
issue of supporting nested arrays
> Parquet writer creates wrong file for nested fields
> ---------------------------------------------------
>
> Key: FLINK-35620
> URL: https://issues.apache.org/jira/browse/FLINK-35620
> Project: Flink
> Issue Type: Bug
> Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
> Affects Versions: 1.19.0
> Reporter: Vicky Papavasileiou
> Priority: Major
>
> After PR [https://github.com/apache/flink/pull/24795] got merged that added
> support for nested arrays, the parquet writer produces wrong parquet files
> that cannot be read. Note, the reader (both flink and iceberg) don't throw an
> exception but return `null` for the nested field.
> The error is in how the field `max_definition_level` is populated for nested
> fields.
> Consider Avro schema:
> {code:java}
> {
> "namespace": "com.test",
> "type": "record",
> "name": "RecordData",
> "fields": [
> {
> "name": "Field1",
> "type": {
> "type": "array",
> "items": {
> "type": "record",
> "name": "NestedField2",
> "fields": [
> { "name": "NestedField3", "type": "double" }
> ]
> }
> }
> }
> ]
> } {code}
>
> Consider the excerpt below of a parquet file produced by Flink for the above
> schema:
> {code:java}
> Column(SegmentStartTime) ############
> name: NestedField3
> path: Field1.list.element.NestedField3
> max_definition_level: 1
> max_repetition_level: 1
> physical_type: DOUBLE
> logical_type: None
> converted_type (legacy): NONE
> compression: SNAPPY (space_saved: 7%) {code}
>
> The max_definition_level should be 4 but is 1
--
This message was sent by Atlassian Jira
(v8.20.10#820010)