[ 
https://issues.apache.org/jira/browse/FLINK-35620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vicky Papavasileiou updated FLINK-35620:
----------------------------------------
    Description: 
After PR [https://github.com/apache/flink/pull/24795] got merged that added 
support for nested arrays, the parquet writer produces wrong parquet files that 
cannot be read. Note, the reader (both flink and iceberg) don't throw an 
exception but return `null` for the nested field. 

The error is in how the field `max_definition_level` is populated for nested 
fields. 

Consider Avro schema:
{code:java}
{
"namespace": "com.test",
"type": "record",
"name": "RecordData",
"fields": [
{
"name": "Field1",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "NestedField2",
"fields": [
{ "name": "NestedField3", "type": "double" }
]
}
}
}
]
} {code}
 

Consider the excerpt below of a parquet file produced by Flink for the above 
schema:
{code:java}
Column(SegmentStartTime) ############
name: NestedField3
path: Field1.list.element.NestedField3
max_definition_level: 1
max_repetition_level: 1
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 7%) {code}
 

The max_definition_level should be 4 but is 1

  was:
After PR [https://github.com/apache/flink/pull/24795] got merged that added 
support for nested arrays, the parquet writer produces wrong parquet files that 
cannot be read. Note, the reader (both flink and iceberg) don't throw an 
exception but return `null` for the nested field. 

The error is in how the field `max_definition_level` is populated for nested 
fields. 

Consider Avro schema:

```

{
"namespace": "com.test",
"type": "record",
"name": "RecordData",
"fields": [
{
"name": "Field1",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "NestedField2",
"fields": [

{ "name": "NestedField3", "type": "double" }

]
}
}
}
]
}

```

Consider the excerpt below of a parquet file produced by Flink for the above 
schema:

```

Column(SegmentStartTime) ############
name: NestedField3
path: Field1.list.element.NestedField3
max_definition_level: 1
max_repetition_level: 1
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 7%)

```

The max_definition_level should be 4 but is 1


> Parquet writer creates wrong file for nested fields
> ---------------------------------------------------
>
>                 Key: FLINK-35620
>                 URL: https://issues.apache.org/jira/browse/FLINK-35620
>             Project: Flink
>          Issue Type: Bug
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>    Affects Versions: 1.19.0
>            Reporter: Vicky Papavasileiou
>            Priority: Major
>
> After PR [https://github.com/apache/flink/pull/24795] got merged that added 
> support for nested arrays, the parquet writer produces wrong parquet files 
> that cannot be read. Note, the reader (both flink and iceberg) don't throw an 
> exception but return `null` for the nested field. 
> The error is in how the field `max_definition_level` is populated for nested 
> fields. 
> Consider Avro schema:
> {code:java}
> {
> "namespace": "com.test",
> "type": "record",
> "name": "RecordData",
> "fields": [
> {
> "name": "Field1",
> "type": {
> "type": "array",
> "items": {
> "type": "record",
> "name": "NestedField2",
> "fields": [
> { "name": "NestedField3", "type": "double" }
> ]
> }
> }
> }
> ]
> } {code}
>  
> Consider the excerpt below of a parquet file produced by Flink for the above 
> schema:
> {code:java}
> Column(SegmentStartTime) ############
> name: NestedField3
> path: Field1.list.element.NestedField3
> max_definition_level: 1
> max_repetition_level: 1
> physical_type: DOUBLE
> logical_type: None
> converted_type (legacy): NONE
> compression: SNAPPY (space_saved: 7%) {code}
>  
> The max_definition_level should be 4 but is 1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to