fapaul commented on a change in pull request #17501:
URL: https://github.com/apache/flink/pull/17501#discussion_r733480109
##########
File path: flink-formats/flink-avro/pom.xml
##########
@@ -26,7 +26,7 @@ under the License.
<groupId>org.apache.flink</groupId>
<artifactId>flink-formats</artifactId>
<version>1.15-SNAPSHOT</version>
- <relativePath>..</relativePath>
+ <relativePath>../pom.xml</relativePath>
Review comment:
> where is the reference to tell us that BulkFormat support streaming?
Afaik, all javadocs about BulkFormat are only talking about batch, please refer
to the javadoc of BulkFormat itself and the javadoc of FileSource.
In general, all formats should support batch and streaming execution. As an
example that `BulkFormat`s are also applicable to streaming executions you can
take a look at this docstring [1]. The docstring mentions checkpoints and how
the last offset/position is tracked. Checkpointing is not supported in batch
execution.
The difference between `BulkFormat` and `FileRecordFormat` is how the
underlying reader interacts with the filesystem. `BulkFormats` usually always
read batches of data i.e. parquet reader always reads blocks/rowgroups as on
the other hand `FileRecordFormat` usually reads the file line by line.
After looking through the `AvroParquetReader` I think your assumption is
right. We cannot implement a bulk format here because the reader does not
expose any information about the underlying block/rowgroup structure.
I am still a bit unsure about the newly introduced `RecordFormat` you have
only mentioned we use the `StreamFormat` to support compression but I think the
right way to support compression for ParquetAvro would be to configure it with
a codecFactory.
```java
AvroParquetReader.<GenericRecord>builder(new
ParquetInputFile(stream, fileLen))
.withCodecFactory(...)
```
[1]
https://github.com/apache/flink/blob/34de7d1038f1078980cc539273b724ce7c85696a/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/BulkFormat.java#L56
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]