[GitHub] [flink] JingGe edited a comment on pull request #17520: [FLINK-24565][avro] Port avro file format factory to BulkReaderFormatFactory

GitBox Thu, 21 Oct 2021 00:05:01 -0700


JingGe edited a comment on pull request #17520:
URL: https://github.com/apache/flink/pull/17520#issuecomment-948313648



   @tsreaper 
   
   > 
   > > For point 1, the uncompressed data size should be controlled by 
`StreamFormat.FETCH_IO_SIZE`. It might not be very precise to control the heap 
size, since the last read might overfulfil the quota a little bit, but it is 
acceptable.
   > 
   > This is not the case. For example xz compression comes with a compression 
ratio of ~15% (google xz compression ratio if you want to confirm). Note that 
avro can be represented both in json and in compact binary form, so you may 
expect a 6x inflation after uncompressing the data. It will become worse as 
Java objects always come with extra overhead and this is not "overfulfil the 
quota a little bit".
   > 
   
   Don't get me wrong. I am talking about the architecture design. "overfulfil 
the quota a little bit" has the context of "last read". This has nothing to do 
with the inflation. Speaking of inflation, the concrete implementation should 
use `StreamFormat.FETCH_IO_SIZE` to control the heap memory usage for the 
uncompressed data, as an example, `AbstractLZ77CompressorInputStream#getSize()` 
returns the uncompressed size of the stream. This is not the reason to choose 
BulkFormat over StreamFormat.
   
   > > `StreamFormatAdapter` has built-in compressors support. Does this PR 
implementation have the same support too?
   > 
   > If you take a look at the implementation of `StreamFormatAdapter` you'll 
find that it supports decompression by calling 
`StandardDeCompression#getDecompressorForFileName`, which determines the 
decompressor by the file extensions. Avro files are often ends with `.avro` so 
there will be no match.
   > 
   > Also avro files are compressed by blocks. Avro files contain their own 
magic numbers, specific headers and block splitters which cannot be understood 
by the standard xz or bzip2 decompressor. You have to use the avro reader to 
interpret the file and the avro reader will deal with all the work like 
decompression or such.
   > 
   
   That's right. That is exactly a good reason to extend the decompression 
logic in the `StreamFormatAdapter` to fulfil the avro requirement. Software 
goes robust in this way.
   
   > > For point 2, `StreamFormat` defines a way to read each record.
   > 
   > The problem is that you just cannot read one record at a time from an avro 
file stream. Avro readers read one **block** at a time from the file stream and 
store the inflated raw bytes in memory. For detailed code see my reply to 
@slinkydeveloper.
   
   again, I am talking about the architecture design. "record" is the abstract 
concept, it does not mean the record in avro. You can control the granularity 
of the abstract "record". This is the beauty of OOD.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] JingGe edited a comment on pull request #17520: [FLINK-24565][avro] Port avro file format factory to BulkReaderFormatFactory

Reply via email to