tsreaper edited a comment on pull request #17520: URL: https://github.com/apache/flink/pull/17520#issuecomment-961764970
@JingGe > Have you tried to control the number of records each batchRead() will fetch instead of fetch all records of the current block in one shot? No I haven't. But I can come up with two problems about this: 1. Some records may be large, for example json strings containing tens of thousands of characters (this is not rare from the production jobs I've seen so far). If we only control the **number** of records there is still risk of overwhelming the memory. The other way is to control the actual size of each record, which requires a method to estimate the number of bytes in each record. 2. The reader must be kept open unless the whole block is deserialized. If we only deserialize a portion of a block in each batch then we still need that block pool to prevent the reader being closed too early. > how you controlled the `StreamFormat.FETCH_IO_SIZE`? Number of bytes read from file is controlled by `StreamFormatAdapter.TrackingFsDataInputStream`. It is controlled by `source.file.stream.io-fetch-size` whose default value is 1MB. However there is no use in tuning this value because avro reader (I mean the reader from avro library) will read the whole block from file. If the file size is 2MB it will consume 2MB of bytes and, according to the current logic of `StreamFormatAdapter`, deserialize all records from that block at once. I've tried to change that config option in the benchmark and it proves me right. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
