[GitHub] [flink] tsreaper commented on pull request #17520: [FLINK-24565][avro] Port avro file format factory to BulkReaderFormatFactory

GitBox Tue, 19 Oct 2021 20:07:26 -0700


tsreaper commented on pull request #17520:
URL: https://github.com/apache/flink/pull/17520#issuecomment-947280429



   @slinkydeveloper There are three reasons why I did not choose `StreamFormat`.
   1. The biggest concern is that `StreamFormatAdapter.Reader#readBatch` stores 
all results in a batch in heap memory. This is bad because avro is a format 
which supports compression. You'll never know how much data will be stuffed 
into heap memory after inflation.
   2. `StreamFormatAdapter` cuts batches by counting number of bytes read from 
the file stream. If the sync size of avro is 2MB it will read 2M bytes from 
file in one go and produce a batch containing no records. However this only 
happens at the beginning of reading a file so this might be OK.
   3. Both orc and parquet formats have implemented `BulkFormat` instead of 
`StreamFormat`, so why not `StreamFormat` for them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] tsreaper commented on pull request #17520: [FLINK-24565][avro] Port avro file format factory to BulkReaderFormatFactory

Reply via email to