tsreaper commented on pull request #17520:
URL: https://github.com/apache/flink/pull/17520#issuecomment-947576179


   @slinkydeveloper 
   
   > The biggest concern is that StreamFormatAdapter.Reader#readBatch stores 
all results in a batch in heap memory.
   
   See [this 
code](https://github.com/apache/flink/blob/99c2a415e9eeefafacf70762b6f54070f7911ceb/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/impl/StreamFormatAdapter.java#L202).
 `StreamFormatAdapter.Reader#readBatch` stores all results in the current batch 
in an `ArrayList`.
   
   On the other hand, this PR implements a `BulkFormat.Reader` by only storing 
the iterator instead of the actual results.
   
   > This is bad because avro is a format which supports compression.
   > But Avro has own batch and compression logical.
   
   See [this 
code](https://github.com/apache/avro/blob/42822886c28ea74a744abb7e7a80a942c540faa5/lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java#L213).
 This PR is using the avro reader as if all records are read from stream one by 
one. This is not true because avro has wrapped the decompression logic in its 
own reader (by reading the whole block at a time and decompress it in memory). 
We're reading records one by one from the **reader**, not from the **stream**. 
`StreamFormatAdapter.Reader` does not know anything about the avro reader. Its 
only concern is the number of bytes read from the **raw** stream.
   
   One might argue that avro reader is also storing decompression results in 
memory and is not doing anything fancy. Yes. But by deserializing the bytes and 
converting them into Java objects we're doubling the memory cost. It might be 
even worse because of Java object overhead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to