[jira] [Comment Edited] (FLINK-24921) FileSourceSplit should not be visible in the user API in ParquetColumnarRowInputFormat

Etienne Chauchot (Jira) Wed, 17 Nov 2021 02:32:06 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-24921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444592#comment-17444592
 ]


Etienne Chauchot edited comment on FLINK-24921 at 11/17/21, 10:31 AM:
----------------------------------------------------------------------

I took a deeper look at the code, [~arvid] you're right (thanks) Parquet Format 
is used to support hive format hence the parametrized type. And indeed, it can 
difficultly be removed. But the factory I see in the parquet format package 
(ParquetFileFormatFactory) is for the table API. So _FileSourceSplit_ will 
surface in the parquet API when using DataStream API unless I'm missing 
something.

While documenting the DataStream connectors from the user point of view, I 
searched how to use _ParquetColumnarRowInputFormat_ by taking a look at the 
tests and they refer either explicitly _FileSourceSplit_ or do a raw use of the 
parametrized _ParquetColumnarRowInputFormat_ class so I though we could improve 
the API.

For now, I'll document as is.


was (Author: echauchot):
I took a deeper look at the code, [~arvid] you're right (thanks) Parquet Format 
is used to support hive format hence the parametrized type. And indeed, it can 
difficultly be removed. But the factory I see in the parquet format package 
(ParquetFileFormatFactory) is for the table API.

While documenting the DataStream connectors from the user point of view, I 
searched how to use _ParquetColumnarRowInputFormat_ by taking a look at the 
tests and they refer either explicitly _FileSourceSplit_ or do a raw use of the 
parametrized _ParquetColumnarRowInputFormat_ class so I though we could improve 
the API.

For now, I'll document as is.

> FileSourceSplit should not be visible in the user API in 
> ParquetColumnarRowInputFormat
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-24921
>                 URL: https://issues.apache.org/jira/browse/FLINK-24921
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>            Priority: Major
>
> _FileSourceSplit_ is an internal class that should not be visible in the user 
> API like 
> [here|https://github.com/apache/flink/blob/6f2d8fe3007464343c5312e27612be448b415148/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/ParquetColumnarRowInputFormatTest.java#L235].
>  The fact that _FileSourceSplit_ surfaces in the API also influences the user 
> to do a raw use of the parametrized class like 
> [here|https://github.com/apache/flink/blob/6f2d8fe3007464343c5312e27612be448b415148/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/ParquetColumnarRowInputFormatTest.java#L407]
> It could be better to make parquet format a not parametrized class as it is 
> done for hive connector
> _class_  HiveBulkFormatAdapter
> _implements BulkFormat<RowData, HiveSourceSplit>_
> rather than
> _class ParquetColumnarRowInputFormat<SplitT extends FileSourceSplit>_
> _extends ParquetVectorizedInputFormat<RowData, SplitT>_
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (FLINK-24921) FileSourceSplit should not be visible in the user API in ParquetColumnarRowInputFormat

Reply via email to