wjzhou-ep commented on issue #41604: URL: https://github.com/apache/arrow/issues/41604#issuecomment-2111457796
@mapleFU Thank you for provide the alternative. The `buffer_size` parameter in `pyarrow_s3fs.open_input_stream(path_src, buffer_size=10_000_000)` will do the same thing as `pyarrow_s3fs.open_input_stream`. My point is, 1. It is not obvious from doc that once we are using `gz` file instead of plain `csv` file, the `block_size` in `ReadOptions` will not affect the read size. (and it is not easy to figure out, the sympton is very slow reading speed, I happened to see the request and realize it is issuing 65K requests) 2. The default 65K might be a little bit too small for nowaday conputers, e.g., when not using gz file, the default batch size for csv streaming is 1M, https://github.com/apache/arrow/blob/657c4faf21700c0899703a4759bde76235c38199/cpp/src/arrow/csv/options.h#L149, -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
