[jira] [Commented] (FLINK-26301) Test AvroParquet format

Dawid Wysakowicz (Jira) Thu, 24 Feb 2022 07:57:07 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-26301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497488#comment-17497488
 ]


Dawid Wysakowicz commented on FLINK-26301:
------------------------------------------

But this is 1. not documented 2. not possible in all cases. 
# In parquet-mr you have the chance to configure projection, pass avro schema 
(see e.g. {{AvroReadSupport#setAvroReadSchema, 
AvroReadSupport#setRequestedProjection}} etc. One can not do that in Flink's 
format
# It works somehow as of now if the parquet file has been created e.g. with 
{{AvroParquetWriters}} because it will encode the avro schema in parquet 
metadata. If parquet files were created in a different way, and this schema is 
not present, it is nearly impossible to provide matching avro schema. The 
automatic parquet -> avro converters are nearly unusable, especially when it 
comes to namespaces and such.
# Not sure if it is a good idea to expose the {{ReflectData}}. I've stumbled 
upon yet another problem that it is not easy to control the optionality of 
fields in ReflectData (see e.g. {{ReflectData#ALLOW_NULL}}). I am afraid this 
might cause confusion and is not easily addressable by users
I'd suggest explaining when it is advised to use this format.

Lastly, having all the limitations in mind, I'd recommend marking it 
{{Experimental}} rather than {{PublicEvolving}}.

> Test AvroParquet format
> -----------------------
>
>                 Key: FLINK-26301
>                 URL: https://issues.apache.org/jira/browse/FLINK-26301
>             Project: Flink
>          Issue Type: Improvement
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>            Reporter: Jing Ge
>            Assignee: Dawid Wysakowicz
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 1.15.0
>
>
> The following scenarios are worthwhile to test
>  * Start a simple job with None/At-least-once/exactly-once delivery guarantee 
> read Avro Generic/sSpecific/Reflect records and write them to an arbitrary 
> sink.
>  * Start the above job with bounded/unbounded data.
>  * Start the above job with streaming/batch execution mode.
>  
> This format works with FileSource[2] and can only be used with DataStream. 
> Normal parquet files can be used as test files. Schema introduced at [1] 
> could be used.
>  
> [1]Reference:
> [1][https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/formats/parquet/]
> [2] 
> [https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-26301) Test AvroParquet format

Reply via email to