Re: [I] [C++][Python] Default value for CompressedInputStream kChunkSize might be too small [arrow]

via GitHub Tue, 14 May 2024 19:18:03 -0700


wjzhou-ep commented on issue #41604:
URL: https://github.com/apache/arrow/issues/41604#issuecomment-2111457796


   @mapleFU Thank you for provide the alternative.
   
   The `buffer_size` parameter in `pyarrow_s3fs.open_input_stream(path_src, 
buffer_size=10_000_000)` will do the same thing as 
`pyarrow_s3fs.open_input_stream`. 
   
   My point is, 
   1. It is not obvious from doc that once we are using `gz` file instead of 
plain `csv` file, the `block_size` in `ReadOptions` will not affect the read 
size. (and it is not easy to figure out, the sympton is very slow reading 
speed, I happened to see the request and realize it is issuing 65K requests)
   2. The default 65K might be a little bit too small for nowaday conputers, 
e.g., when not using gz file, the default batch size for csv streaming is 1M,   
https://github.com/apache/arrow/blob/657c4faf21700c0899703a4759bde76235c38199/cpp/src/arrow/csv/options.h#L149,
 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++][Python] Default value for CompressedInputStream kChunkSize might be too small [arrow]

Reply via email to