JingGe edited a comment on pull request #17520: URL: https://github.com/apache/flink/pull/17520#issuecomment-962947461
@tsreaper > No I haven't. But I can come up with one problems about this: Some records may be large, for example json strings containing tens of thousands of characters (this is not rare from the production jobs I've seen so far). If we only control the **number** of records there is still risk of overwhelming the memory. The other way is to control the actual size of each record, which requires a method to estimate the number of bytes in each record. To make the discussion easier, we are talking about the benchmark data whose records have almost same size. For real cases, we can control the number of records dynamically by controlling the bytes read from the inputStream, e.g. in each batchRead, read 5 records for big size records and read 50 records for small size records. > > > how you controlled the `StreamFormat.FETCH_IO_SIZE`? > > Number of bytes read from file is controlled by `StreamFormatAdapter.TrackingFsDataInputStream`. It is controlled by `source.file.stream.io-fetch-size` whose default value is 1MB. However there is no use in tuning this value because avro reader (I mean the reader from avro library) will read the whole block from file. If the file size is 2MB it will consume 2MB of bytes and, according to the current logic of `StreamFormatAdapter`, deserialize all records from that block at once. I've tried to change that config option in the benchmark and it proves me right. if you take a close look at the implementation of `TrackingFsDataInputStream`, you will see how it uses `StreamFormat.FETCH_IO_SIZE` to control how many records will be read/deserilized from the avro block in each batchRead(). Any way, the benchmark result tell us the truth. Thanks again for sharing it. We will do more deep dive to figure out why using StreamFormat has these memory issues later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
