[ 
https://issues.apache.org/jira/browse/FLINK-29527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Shun updated FLINK-29527:
-----------------------------
    Description: 
Currently, from the improvement [FLINK-23715], Flink use a collection named 
`unknownFieldsIndices` to track the nonexistent fields, and it is kept inside 
the `ParquetVectorizedInputFormat`, and applied to all parquet files under 
given path.

However, some fields may only be nonexistent in some of the historical parquet 
files, while exist in latest ones. And based on `unknownFieldsIndices`, flink 
will always skip these fields, even thought they are existing in the later 
parquets.

As a result, the value of these fields will become empty when they are 
nonexistent in some historical parquet files.

  was:
Currently, from the improvement [[FLINK-23715], Flink use a collection named 
`unknownFieldsIndices` to track the nonexistent fields, and it is kept inside 
the `ParquetVectorizedInputFormat`, and applied to all parquet files under 
given path.

However, some fields may only be nonexistent in some of the historical parquet 
files, while exist in latest ones. And based on `unknownFieldsIndices`, flink 
will always skip these fields, even thought they are existing in the later 
parquets.

As a result, the value of these fields will become empty when they are 
nonexistent in some historical parquet files.


> Make unknownFieldsIndices work for single ParquetReader
> -------------------------------------------------------
>
>                 Key: FLINK-29527
>                 URL: https://issues.apache.org/jira/browse/FLINK-29527
>             Project: Flink
>          Issue Type: Bug
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>    Affects Versions: 1.16.0
>            Reporter: Sun Shun
>            Priority: Major
>              Labels: pull-request-available
>
> Currently, from the improvement [FLINK-23715], Flink use a collection named 
> `unknownFieldsIndices` to track the nonexistent fields, and it is kept inside 
> the `ParquetVectorizedInputFormat`, and applied to all parquet files under 
> given path.
> However, some fields may only be nonexistent in some of the historical 
> parquet files, while exist in latest ones. And based on 
> `unknownFieldsIndices`, flink will always skip these fields, even thought 
> they are existing in the later parquets.
> As a result, the value of these fields will become empty when they are 
> nonexistent in some historical parquet files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to