Re: [I] [C++][Python] Default value for CompressedInputStream kChunkSize might be too small [arrow]

via GitHub Tue, 14 May 2024 21:32:50 -0700


wjzhou commented on issue #41604:
URL: https://github.com/apache/arrow/issues/41604#issuecomment-2111562353


   set `buffer_size=10_000_000` explicitly in `pyarrow_s3fs.open_input_stream` 
works, just the default is None
   
   
https://github.com/apache/arrow/blob/657c4faf21700c0899703a4759bde76235c38199/python/pyarrow/_fs.pyx#L744-L746
   
   
   Acturally, after thinking of it, I think the acturally problem is when 
CompressedInputStream is in use, it will convert a 
   a upper layer `Read(1_000_000, out)` into multiply lower layer calling of 
`Read(64 * 1024, out)` 
   e.g.
   ```
   Result<int64_t> Read(int64_t nbytes, void* out) {
      ...
     while (nbytes - total_read > 0 && decompressor_has_data) {
         ...
        ARROW_ASSIGN_OR_RAISE(decompressor_has_data, RefillDecompressed());
        ...
     }
   }
   
   Result<bool> RefillDecompressed() {
     ...
     RETURN_NOT_OK(EnsureCompressedData());
    ...
   }
   
   Status EnsureCompressedData() {
     ...
     raw_->Read(kChunkSize,
                          
compressed_for_non_zero_copy_->mutable_data_as<void>()));
    // where kChunkSize is 64k
     ...
   }
   
   ```
   
   
https://github.com/apache/arrow/blob/657c4faf21700c0899703a4759bde76235c38199/cpp/src/arrow/io/compressed.cc#L388C2-L408C4
   
   > Or maybe try again with a input stream wrapped with a buffer_size:
   Once know the reason of the problem, the adding of `BufferedInputStream` is 
a good solution.
   
   But for a new pyarrow user, it is hard to know that the 
pa.BufferedInputStream is needed. e,g, 
   The following code is totally fine if the path_src is not end with .gz etc, 
it will issue one read
   ```
   path_src = "file.csv"
   
   with pyarrow_s3fs.open_input_stream(path_src) as f:
       buff = f.read()
   ```
   But if the `path_src = "file.csv.gz"`, we need to add the 
`BufferedInputStream` wrapper in order to prevent the 64K ranged reading
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++][Python] Default value for CompressedInputStream kChunkSize might be too small [arrow]

Reply via email to