[
https://issues.apache.org/jira/browse/FLINK-24921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444592#comment-17444592
]
Etienne Chauchot edited comment on FLINK-24921 at 11/17/21, 10:31 AM:
----------------------------------------------------------------------
I took a deeper look at the code, [~arvid] you're right (thanks) Parquet Format
is used to support hive format hence the parametrized type. And indeed, it can
difficultly be removed. But the factory I see in the parquet format package
(ParquetFileFormatFactory) is for the table API. So _FileSourceSplit_ will
surface in the parquet API when using DataStream API unless I'm missing
something.
While documenting the DataStream connectors from the user point of view, I
searched how to use _ParquetColumnarRowInputFormat_ by taking a look at the
tests and they refer either explicitly _FileSourceSplit_ or do a raw use of the
parametrized _ParquetColumnarRowInputFormat_ class so I though we could improve
the API.
For now, I'll document as is.
was (Author: echauchot):
I took a deeper look at the code, [~arvid] you're right (thanks) Parquet Format
is used to support hive format hence the parametrized type. And indeed, it can
difficultly be removed. But the factory I see in the parquet format package
(ParquetFileFormatFactory) is for the table API.
While documenting the DataStream connectors from the user point of view, I
searched how to use _ParquetColumnarRowInputFormat_ by taking a look at the
tests and they refer either explicitly _FileSourceSplit_ or do a raw use of the
parametrized _ParquetColumnarRowInputFormat_ class so I though we could improve
the API.
For now, I'll document as is.
> FileSourceSplit should not be visible in the user API in
> ParquetColumnarRowInputFormat
> --------------------------------------------------------------------------------------
>
> Key: FLINK-24921
> URL: https://issues.apache.org/jira/browse/FLINK-24921
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / FileSystem
> Reporter: Etienne Chauchot
> Assignee: Etienne Chauchot
> Priority: Major
>
> _FileSourceSplit_ is an internal class that should not be visible in the user
> API like
> [here|https://github.com/apache/flink/blob/6f2d8fe3007464343c5312e27612be448b415148/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/ParquetColumnarRowInputFormatTest.java#L235].
> The fact that _FileSourceSplit_ surfaces in the API also influences the user
> to do a raw use of the parametrized class like
> [here|https://github.com/apache/flink/blob/6f2d8fe3007464343c5312e27612be448b415148/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/ParquetColumnarRowInputFormatTest.java#L407]
> It could be better to make parquet format a not parametrized class as it is
> done for hive connector
> _class_ HiveBulkFormatAdapter
> _implements BulkFormat<RowData, HiveSourceSplit>_
> rather than
> _class ParquetColumnarRowInputFormat<SplitT extends FileSourceSplit>_
> _extends ParquetVectorizedInputFormat<RowData, SplitT>_
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)