tsreaper edited a comment on pull request #17520: URL: https://github.com/apache/flink/pull/17520#issuecomment-948202218
@JingGe > For point 1, the uncompressed data size should be controlled by `StreamFormat.FETCH_IO_SIZE`. It might not be very precise to control the heap size, since the last read might overfulfil the quota a little bit, but it is acceptable. This is not the case. For example xz compression comes with a compression ratio of ~15% (google xz compression ratio if you want to confirm). Note that avro can be represented both in json and in compact binary form, so you may expect a 6x inflation after uncompressing the data. It will become worse as Java objects always come with extra overhead and this is not "overfulfil the quota a little bit". > `StreamFormatAdapter` has built-in compressors support. Does this PR implementation have the same support too? If you take a look at the implementation of `StreamFormatAdapter` you'll find that it supports decompression by calling `StandardDeCompression#getDecompressorForFileName`, which determines the decompressor by the file extensions. Avro files are often ends with `.avro` so there will be no match. Also avro files are compressed by blocks. Avro files contain their own magic numbers, specific headers and block splitters which cannot be understood by the standard xz or bzip2 decompressor. You have to use the avro reader to interpret the file and the avro reader will deal with all the work like decompression or such. > For point 2, `StreamFormat` defines a way to read each record. The problem is that you just cannot read one record at a time from an avro file stream. Avro readers read one **block** at a time from the file stream and store the inflated raw bytes in memory. For detailed code see my reply to @slinkydeveloper. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
