Re: [PR] GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files [arrow]

via GitHub Thu, 05 Oct 2023 06:08:42 -0700


lidavidm commented on code in PR #37854:
URL: https://github.com/apache/arrow/pull/37854#discussion_r1347392116



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -666,7 +666,7 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
         Disabled by default.
     buffer_size : int, default 8192
         Size of buffered stream, if enabled. Default is 8KB.
-    pre_buffer : bool, default False
+    pre_buffer : bool, default True
         If enabled, pre-buffer the raw Parquet data instead of issuing one
         read per column chunk. This can improve performance on high-latency
         filesystems.

Review Comment:
   Possibly we should improve the docstring to also mention that you should 
disable this if you are concerned with memory usage over throughput? (Also, 
possibly make it clear that "high-latency filesystems" is likely to mean object 
stores like S3, GCS, etc.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files [arrow]

Reply via email to