[
https://issues.apache.org/jira/browse/FLINK-26301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497488#comment-17497488
]
Dawid Wysakowicz commented on FLINK-26301:
------------------------------------------
But this is 1. not documented 2. not possible in all cases.
# In parquet-mr you have the chance to configure projection, pass avro schema
(see e.g. {{AvroReadSupport#setAvroReadSchema,
AvroReadSupport#setRequestedProjection}} etc. One can not do that in Flink's
format
# It works somehow as of now if the parquet file has been created e.g. with
{{AvroParquetWriters}} because it will encode the avro schema in parquet
metadata. If parquet files were created in a different way, and this schema is
not present, it is nearly impossible to provide matching avro schema. The
automatic parquet -> avro converters are nearly unusable, especially when it
comes to namespaces and such.
# Not sure if it is a good idea to expose the {{ReflectData}}. I've stumbled
upon yet another problem that it is not easy to control the optionality of
fields in ReflectData (see e.g. {{ReflectData#ALLOW_NULL}}). I am afraid this
might cause confusion and is not easily addressable by users
I'd suggest explaining when it is advised to use this format.
Lastly, having all the limitations in mind, I'd recommend marking it
{{Experimental}} rather than {{PublicEvolving}}.
> Test AvroParquet format
> -----------------------
>
> Key: FLINK-26301
> URL: https://issues.apache.org/jira/browse/FLINK-26301
> Project: Flink
> Issue Type: Improvement
> Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
> Reporter: Jing Ge
> Assignee: Dawid Wysakowicz
> Priority: Blocker
> Labels: release-testing
> Fix For: 1.15.0
>
>
> The following scenarios are worthwhile to test
> * Start a simple job with None/At-least-once/exactly-once delivery guarantee
> read Avro Generic/sSpecific/Reflect records and write them to an arbitrary
> sink.
> * Start the above job with bounded/unbounded data.
> * Start the above job with streaming/batch execution mode.
>
> This format works with FileSource[2] and can only be used with DataStream.
> Normal parquet files can be used as test files. Schema introduced at [1]
> could be used.
>
> [1]Reference:
> [1][https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/formats/parquet/]
> [2]
> [https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/]
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)