tsreaper edited a comment on pull request #17520:
URL: https://github.com/apache/flink/pull/17520#issuecomment-948202218


   @JingGe 
   
   > For point 1, the uncompressed data size should be controlled by 
`StreamFormat.FETCH_IO_SIZE`. It might not be very precise to control the heap 
size, since the last read might overfulfil the quota a little bit, but it is 
acceptable.
   
   This is not the case. For example xz compression comes with a compression 
ratio of ~15% (google xz compression ratio if you want to confirm). Note that 
avro can be represented both in json and in compact binary form, so you may 
expect a 6x inflation after uncompressing the data. It will become worse as 
Java objects always come with extra overhead and this is not "overfulfil the 
quota a little bit".
   
   > `StreamFormatAdapter` has built-in compressors support. Does this PR 
implementation have the same support too?
   
   If you take a look at the implementation of `StreamFormatAdapter` you'll 
find that it supports decompression by calling 
`StandardDeCompression#getDecompressorForFileName`, which determines the 
decompressor by the file extensions. Avro files are often ends with `.avro` so 
there will be no match.
   
   Also avro files are compressed by blocks. Avro files contain their own magic 
numbers, specific headers and block splitters which cannot be understood by the 
standard xz or bzip2 decompressor. You have to use the avro reader to interpret 
the file and the avro reader will deal with all the work like decompression or 
such.
   
   > For point 2, `StreamFormat` defines a way to read each record.
   
   The problem is that you just cannot read one record at a time from an avro 
file stream. Avro readers read one **block** at a time from the file stream and 
store the inflated raw bytes in memory. For detailed code see my reply to 
@slinkydeveloper.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to