[ 
https://issues.apache.org/jira/browse/FLINK-29579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17615782#comment-17615782
 ] 

Tiansu Yu edited comment on FLINK-29579 at 10/11/22 1:20 PM:
-------------------------------------------------------------

Thanks. This motivates me to take a look on the latest stable version of the 
Parquet IO section 
([https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/datastream/formats/parquet/)]

I notice that the last section for reading Avro ReflectionRecord, the example 
uses a class called 
[AvroParquetReaders|https://nightlies.apache.org/flink/flink-docs-release-1.15/api/java/index.html?org/apache/flink/formats/parquet/avro/AvroParquetReaders.html]
 but the class itself is annotated with Experimental. I wonder if it is 
expected that Parquet IO will be changed again soon some time in the future? 


was (Author: JIRAUSER291827):
Thanks. This motivates me to take a look on the latest stable version of the 
Parquet IO section 
([https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/datastream/formats/parquet/)]

I notice that the last section for reading Avro ReflectionRecord, the example 
uses a class called 
[AvroParquetReaders|[https://nightlies.apache.org/flink/flink-docs-release-1.15/api/java/index.html?org/apache/flink/formats/parquet/avro/AvroParquetReaders.html],]
 but the class itself is annotated with Experimental. I wonder if it is 
expected that Parquet IO will be changed again soon some time in the future? 

> Flink parquet reader cannot read fully optional elements in a repeated list
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-29579
>                 URL: https://issues.apache.org/jira/browse/FLINK-29579
>             Project: Flink
>          Issue Type: Bug
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>    Affects Versions: 1.13.2
>            Reporter: Tiansu Yu
>            Priority: Major
>              Labels: SchemaValidation, parquet, parquetReader
>
> While trying to read a parquet file containing the following field as part of 
> the schema, 
> {code:java}
>  optional group attribute_values (LIST) {
>     repeated group list {
>       optional group element {
>         optional binary attribute_key_id (STRING);
>         optional binary attribute_value_id (STRING);
>         optional int32 pos;
>       }
>     }
>   } {code}
>  I encountered the following problem 
> {code:java}
> Exception in thread "main" java.lang.UnsupportedOperationException: List 
> field [optional binary attribute_key_id (STRING)] in List [attribute_values] 
> has to be required. 
>       at 
> org.apache.flink.formats.parquet.utils.ParquetSchemaConverter.convertGroupElementToArrayTypeInfo(ParquetSchemaConverter.java:338)
>       at 
> org.apache.flink.formats.parquet.utils.ParquetSchemaConverter.convertParquetTypeToTypeInfo(ParquetSchemaConverter.java:271)
>       at 
> org.apache.flink.formats.parquet.utils.ParquetSchemaConverter.convertFields(ParquetSchemaConverter.java:81)
>       at 
> org.apache.flink.formats.parquet.utils.ParquetSchemaConverter.fromParquetType(ParquetSchemaConverter.java:61)
>       at 
> org.apache.flink.formats.parquet.ParquetInputFormat.<init>(ParquetInputFormat.java:120)
>       at 
> org.apache.flink.formats.parquet.ParquetRowInputFormat.<init>(ParquetRowInputFormat.java:39)
>  {code}
> The main code that raises the problem goes as follows:
> {code:java}
> private static ObjectArrayTypeInfo convertGroupElementToArrayTypeInfo(
>             GroupType arrayFieldType, GroupType elementType) {
>         for (Type type : elementType.getFields()) {
>             if (!type.isRepetition(Type.Repetition.REQUIRED)) {
>                 throw new UnsupportedOperationException(
>                         String.format(
>                                 "List field [%s] in List [%s] has to be 
> required. ",
>                                 type.toString(), arrayFieldType.getName()));
>             }
>         }
>         return 
> ObjectArrayTypeInfo.getInfoFor(convertParquetTypeToTypeInfo(elementType));
>     } {code}
> I am not very familiar with internals of Parquet schema. But the problem 
> looks like to me is that Flink is too restrictive on repetition types inside 
> certain nested fields. Would love to hear some feedbacks on this 
> (improvements, corrections / workarounds).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to