[GitHub] [arrow] jorisvandenbossche commented on issue #36765: [Python][Dataset][Parquet] Enable Pre-Buffering by default for Parquet s3 datasets

via GitHub Mon, 25 Sep 2023 03:19:22 -0700


jorisvandenbossche commented on issue #36765:
URL: https://github.com/apache/arrow/issues/36765#issuecomment-1733375154


   I was chatting about this issue with some people at PyData Amsterdam, and 
was planning to make a PR to just switch the default when back, so here it is: 
https://github.com/apache/arrow/pull/37854
   
   That's only changing the default for Python (`pyarrow.dataset`), but should 
we also change the default in C++? 
   From a basic check, it seems the R code already sets it by default (this was 
changed a while ago in https://github.com/apache/arrow/pull/11386). 
   
   I noticed that the R PR was also setting the `cache_options` to 
`LazyDefaults`. That's then also something we want to change in the Python/C++ 
side? (current default is `CacheOptions::Defaults()`)
   
   Another useful reference for the above discussion is 
https://github.com/apache/arrow/issues/28218 
(https://issues.apache.org/jira/browse/ARROW-12428), where @lidavidm did some 
benchmarks with pre_buffer enabled/disabled, and which was the reason for 
exposing the pre_buffer option in `pyarrow.parquet` with a default of True 
(https://github.com/apache/arrow/pull/10074)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #36765: [Python][Dataset][Parquet] Enable Pre-Buffering by default for Parquet s3 datasets

Reply via email to