pitrou commented on issue #41604:
URL: https://github.com/apache/arrow/issues/41604#issuecomment-2111974970
The bottom line is that `CompressedInputStream::kChunkSize` is inadequate
for some input file backends (such as S3).
I think we should add a new `InputStream` method that advertises a preferred
chunk size:
```c++
/// \brief Return the preferred chunk size for reading at least `nbytes`
///
/// Different file backends have different performance characteristics
/// (especially on the latency / bandwidth spectrum).
/// This method informs the caller on a well-performing read size
/// for the given logical read size.
///
/// Implementations of this method are free to ignore the input `nbytes`
/// when computing the return value. The return value might be smaller,
/// larger or equal to the input value.
///
/// This method should be deterministic: multiple calls on the same object
/// with the same input argument will return the same value. Therefore,
/// calling it once on a given file should be sufficient.
///
/// There are two ways for callers to use this method:
/// 1) callers which support readahead into an internal buffer will
/// use the return value as their internal buffer's size;
/// 2) callers which require an exact read size will use the return value
as
/// an advisory chunk size when reading.
///
/// \param[in] nbytes the logical number of bytes desired by the caller
/// \return an advisory physical chunk size, in bytes
virtual int64_t preferred_read_size(int64_t nbytes) const;
```
Then the `CompressedInputStream` implementation can call
`preferred_read_size` on its input to decide its compressed buffer size.
@felipecrv What do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]