wjzhou commented on issue #41604: URL: https://github.com/apache/arrow/issues/41604#issuecomment-2111562353
set `buffer_size=10_000_000` explicitly in `pyarrow_s3fs.open_input_stream` works, just the default is None https://github.com/apache/arrow/blob/657c4faf21700c0899703a4759bde76235c38199/python/pyarrow/_fs.pyx#L744-L746 Acturally, after thinking of it, I think the acturally problem is when CompressedInputStream is in use, it will convert a a upper layer `Read(1_000_000, out)` into multiply lower layer calling of `Read(64 * 1024, out)` e.g. ``` Result<int64_t> Read(int64_t nbytes, void* out) { ... while (nbytes - total_read > 0 && decompressor_has_data) { ... ARROW_ASSIGN_OR_RAISE(decompressor_has_data, RefillDecompressed()); ... } } Result<bool> RefillDecompressed() { ... RETURN_NOT_OK(EnsureCompressedData()); ... } Status EnsureCompressedData() { ... raw_->Read(kChunkSize, compressed_for_non_zero_copy_->mutable_data_as<void>())); // where kChunkSize is 64k ... } ``` https://github.com/apache/arrow/blob/657c4faf21700c0899703a4759bde76235c38199/cpp/src/arrow/io/compressed.cc#L388C2-L408C4 > Or maybe try again with a input stream wrapped with a buffer_size: Once know the reason of the problem, the adding of `BufferedInputStream` is a good solution. But for a new pyarrow user, it is hard to know that the pa.BufferedInputStream is needed. e,g, The following code is totally fine if the path_src is not end with .gz etc, it will issue one read ``` path_src = "file.csv" with pyarrow_s3fs.open_input_stream(path_src) as f: buff = f.read() ``` But if the `path_src = "file.csv.gz"`, we need to add the `BufferedInputStream` wrapper in order to prevent the 64K ranged reading -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
