JingGe commented on pull request #17520:
URL: https://github.com/apache/flink/pull/17520#issuecomment-947822402


   > @slinkydeveloper There are four reasons why I did not choose 
`StreamFormat`.
   > 
   > 1. The biggest concern is that `StreamFormatAdapter.Reader#readBatch` 
stores all results in a batch in heap memory. This is bad because avro is a 
format which supports compression. You'll never know how much data will be 
stuffed into heap memory after inflation.
   > 2. `StreamFormat`, from its concept, is for a stream of bytes where each 
record is shipped independently. Avro is a file format which organizes the 
records in its own blocks, so they do not match from the concept. I would say 
csv format will be more suitable for `StreamFormat`.
   > 3. `StreamFormatAdapter` cuts batches by counting number of bytes read 
from the file stream. If the sync size of avro is 2MB it will read 2M bytes 
from file in one go and produce a batch containing no records. However this 
only happens at the beginning of reading a file so this might be OK.
   > 4. Both orc and parquet formats have implemented `BulkFormat` instead of 
`StreamFormat`, so why not `StreamFormat` for them?
   
   The consideration behind your solution was great! Thanks for your 
contribution. Let's discuss and understand the design together. Correct me if I 
am wrong.
   
   For point 1, the uncompressed data size should be controlled by 
`StreamFormat.FETCH_IO_SIZE`. It might not be very precise to control the heap 
size, since the last read might overfulfil the quota a little bit, but it is 
acceptable. Speaking of compression, StreamFormatAdapter has built-in 
compressors support. Does this PR implementation have the same support too?
   
   For point 2, StreamFormat defines a way to read each record. The granularity 
of each record could be controlled by the generic type 
`StreamFormat.Reader<T>`. There is plenty room to play if single avro record is 
too small in this case.
   
   For point 4, it is a good question, we should deep dive into the code. Might 
It make sense to refactor the orc and parquet formats to StreamFormat too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to