JingGe edited a comment on pull request #17520: URL: https://github.com/apache/flink/pull/17520#issuecomment-948313648
@tsreaper > > > For point 1, the uncompressed data size should be controlled by `StreamFormat.FETCH_IO_SIZE`. It might not be very precise to control the heap size, since the last read might overfulfil the quota a little bit, but it is acceptable. > > This is not the case. For example xz compression comes with a compression ratio of ~15% (google xz compression ratio if you want to confirm). Note that avro can be represented both in json and in compact binary form, so you may expect a 6x inflation after uncompressing the data. It will become worse as Java objects always come with extra overhead and this is not "overfulfil the quota a little bit". > Don't get me wrong. I am talking about the architecture design. "overfulfil the quota a little bit" has the context of "last read". This has nothing to do with the inflation. Speaking of inflation, the concrete implementation should use `StreamFormat.FETCH_IO_SIZE` to control the heap memory usage for the uncompressed data, as an example, `AbstractLZ77CompressorInputStream#getSize()` returns the uncompressed size of the stream. This is not the reason to choose BulkFormat over StreamFormat. > > `StreamFormatAdapter` has built-in compressors support. Does this PR implementation have the same support too? > > If you take a look at the implementation of `StreamFormatAdapter` you'll find that it supports decompression by calling `StandardDeCompression#getDecompressorForFileName`, which determines the decompressor by the file extensions. Avro files are often ends with `.avro` so there will be no match. > > Also avro files are compressed by blocks. Avro files contain their own magic numbers, specific headers and block splitters which cannot be understood by the standard xz or bzip2 decompressor. You have to use the avro reader to interpret the file and the avro reader will deal with all the work like decompression or such. > That's right. That is exactly a good reason to extend the decompression logic in the `StreamFormatAdapter` to fulfil the avro requirement. Software goes robust in this way. > > For point 2, `StreamFormat` defines a way to read each record. > > The problem is that you just cannot read one record at a time from an avro file stream. Avro readers read one **block** at a time from the file stream and store the inflated raw bytes in memory. For detailed code see my reply to @slinkydeveloper. again, I am talking about the architecture design. "record" is the abstract concept, it does not mean the record in avro. You can control the granularity of the abstract "record". This is the beauty of OOD. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
