[GitHub] [flink] tsreaper commented on pull request #17520: [FLINK-24565][avro] Port avro file format factory to BulkReaderFormatFactory

GitBox Wed, 20 Oct 2021 04:30:31 -0700


tsreaper commented on pull request #17520:
URL: https://github.com/apache/flink/pull/17520#issuecomment-947576179

@slinkydeveloper

> The biggest concern is that StreamFormatAdapter.Reader#readBatch stores
all results in a batch in heap memory.

See [this
code](https://github.com/apache/flink/blob/99c2a415e9eeefafacf70762b6f54070f7911ceb/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/impl/StreamFormatAdapter.java#L202).
`StreamFormatAdapter.Reader#readBatch` stores all results in the current batch
in an `ArrayList`.

On the other hand, this PR implements a `BulkFormat.Reader` by only storing
the iterator instead of the actual results.

> This is bad because avro is a format which supports compression.
> But Avro has own batch and compression logical.

See [this
code](https://github.com/apache/avro/blob/42822886c28ea74a744abb7e7a80a942c540faa5/lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java#L213).
This PR is using the avro reader as if all records are read from stream one by
one. This is not true because avro has wrapped the decompression logic in its
own reader (by reading the whole block at a time and decompress it in memory).
We're reading records one by one from the **reader**, not from the **stream**.
`StreamFormatAdapter.Reader` does not know anything about the avro reader. Its
only concern is the number of bytes read from the **raw** stream.

One might argue that avro reader is also storing decompression results in
memory and is not doing anything fancy. Yes. But by deserializing the bytes and
converting them into Java objects we're doubling the memory cost. It might be
even worse because of Java object overhead.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] tsreaper commented on pull request #17520: [FLINK-24565][avro] Port avro file format factory to BulkReaderFormatFactory

Reply via email to