lidavidm commented on code in PR #37854:
URL: https://github.com/apache/arrow/pull/37854#discussion_r1347392116
##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -666,7 +666,7 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
Disabled by default.
buffer_size : int, default 8192
Size of buffered stream, if enabled. Default is 8KB.
- pre_buffer : bool, default False
+ pre_buffer : bool, default True
If enabled, pre-buffer the raw Parquet data instead of issuing one
read per column chunk. This can improve performance on high-latency
filesystems.
Review Comment:
Possibly we should improve the docstring to also mention that you should
disable this if you are concerned with memory usage over throughput? (Also,
possibly make it clear that "high-latency filesystems" is likely to mean object
stores like S3, GCS, etc.)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]