[jira] [Commented] (FLINK-35620) Parquet writer creates wrong file for nested fields

Vicky Papavasileiou (Jira) Fri, 14 Jun 2024 09:05:08 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-35620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855098#comment-17855098
 ]


Vicky Papavasileiou commented on FLINK-35620:
---------------------------------------------

[~lzljs3620320] [~jingge] FYI: The previous PR 
[https://github.com/apache/flink/pull/24795] did not completely address the 
issue of supporting nested arrays

> Parquet writer creates wrong file for nested fields
> ---------------------------------------------------
>
>                 Key: FLINK-35620
>                 URL: https://issues.apache.org/jira/browse/FLINK-35620
>             Project: Flink
>          Issue Type: Bug
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>    Affects Versions: 1.19.0
>            Reporter: Vicky Papavasileiou
>            Priority: Major
>
> After PR [https://github.com/apache/flink/pull/24795] got merged that added 
> support for nested arrays, the parquet writer produces wrong parquet files 
> that cannot be read. Note, the reader (both flink and iceberg) don't throw an 
> exception but return `null` for the nested field. 
> The error is in how the field `max_definition_level` is populated for nested 
> fields. 
> Consider Avro schema:
> {code:java}
> {
> "namespace": "com.test",
> "type": "record",
> "name": "RecordData",
> "fields": [
> {
> "name": "Field1",
> "type": {
> "type": "array",
> "items": {
> "type": "record",
> "name": "NestedField2",
> "fields": [
> { "name": "NestedField3", "type": "double" }
> ]
> }
> }
> }
> ]
> } {code}
>  
> Consider the excerpt below of a parquet file produced by Flink for the above 
> schema:
> {code:java}
> Column(SegmentStartTime) ############
> name: NestedField3
> path: Field1.list.element.NestedField3
> max_definition_level: 1
> max_repetition_level: 1
> physical_type: DOUBLE
> logical_type: None
> converted_type (legacy): NONE
> compression: SNAPPY (space_saved: 7%) {code}
>  
> The max_definition_level should be 4 but is 1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35620) Parquet writer creates wrong file for nested fields

Reply via email to