[GitHub] [flink] tsreaper edited a comment on pull request #17520: [FLINK-24565][avro] Port avro file format factory to BulkReaderFormatFactory

GitBox Thu, 21 Oct 2021 00:32:22 -0700


tsreaper edited a comment on pull request #17520:
URL: https://github.com/apache/flink/pull/17520#issuecomment-948333809

@JingGe

> "record" is the abstract concept, it does not mean the record in avro.

Are you suggesting an avro `StreamFormat` which produces an avro block,
instead of a Flink row data, at a time? If yes we'll need another operator
after the source to break the block into several row data. Why not leave all
these work inside the source?

> "overfulfil the quota a little bit" has the context of "last read". This
has nothing to do with the inflation.

I guess by "overfulfil the quota a little bit" you mean the number of bytes
read from the stream. This is true but what I'm considering is that
`StreamFormatAdapter.Reader` is storing all the results in a batch in memory at
the same time (see [this
code](https://github.com/apache/flink/blob/99c2a415e9eeefafacf70762b6f54070f7911ceb/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/impl/StreamFormatAdapter.java#L202)
and also my reply to @slinkydeveloper). This might cause OOM for a highly
compressed file.

One way to work around this is to create a `StreamFormatAdapter.Reader`
which uses iterators, but I guess this is another topic.

> That is exactly a good reason to extend the decompression logic in the
StreamFormatAdapter to fulfil the avro requirement. Software goes robust in
this way.

Avro is not a compression algorithm or such. It is a type of row-oriented
file format and you can see it as normal files like .xls or .png. This is why
it exists as a sole module in `flink-formats`, instead of staying with the
compressors in `StandardDeCompression` (you don't want to put a xls resolver in
that file, do you). If you really do this you'll be essentially moving the
whole avro module into `StandardDeCompression`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] tsreaper edited a comment on pull request #17520: [FLINK-24565][avro] Port avro file format factory to BulkReaderFormatFactory

Reply via email to