[GitHub] [flink] fapaul commented on a change in pull request #17501: [Draft][FLINK-21406][RecordFormat] build AvroParquetRecordFormat for the new FileSource

GitBox Thu, 21 Oct 2021 02:14:10 -0700


fapaul commented on a change in pull request #17501:
URL: https://github.com/apache/flink/pull/17501#discussion_r733480109




##########
File path: flink-formats/flink-avro/pom.xml
##########
@@ -26,7 +26,7 @@ under the License.
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-formats</artifactId>
                <version>1.15-SNAPSHOT</version>
-               <relativePath>..</relativePath>
+               <relativePath>../pom.xml</relativePath>

Review comment:
       > where is the reference to tell us that BulkFormat support streaming? 
Afaik, all javadocs about BulkFormat are only talking about batch, please refer 
to the javadoc of BulkFormat itself and the javadoc of FileSource.
   
   In general, all formats should support batch and streaming execution. As an 
example that `BulkFormat`s are also applicable to streaming executions you can 
take a look at this docstring [1]. The docstring mentions checkpoints and how 
the last offset/position is tracked. Checkpointing is not supported in batch 
execution. 
   The difference between `BulkFormat` and `FileRecordFormat` is how the 
underlying reader interacts with the filesystem. `BulkFormats` usually always 
read batches of data i.e. parquet reader always reads blocks/rowgroups as on 
the other hand `FileRecordFormat` usually reads the file line by line.
   
   After looking through the `AvroParquetReader` I think your assumption is 
right. We cannot implement a bulk format here because the reader does not 
expose any information about the underlying block/rowgroup structure.
   
   I am still a bit unsure about the newly introduced `RecordFormat` you have 
only mentioned we use the `StreamFormat` to support compression but I think the 
right way to support compression for ParquetAvro would be to configure it with 
a codecFactory.
   
   ```java
                   AvroParquetReader.<GenericRecord>builder(new 
ParquetInputFile(stream, fileLen))
                           .withCodecFactory(...)
   ```
   [1] 
https://github.com/apache/flink/blob/34de7d1038f1078980cc539273b724ce7c85696a/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/BulkFormat.java#L56




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] fapaul commented on a change in pull request #17501: [Draft][FLINK-21406][RecordFormat] build AvroParquetRecordFormat for the new FileSource

Reply via email to