lidavidm commented on a change in pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#discussion_r412987199



##########
File path: cpp/src/parquet/properties.h
##########
@@ -56,10 +60,32 @@ class PARQUET_EXPORT ReaderProperties {
 
   bool is_buffered_stream_enabled() const { return buffered_stream_enabled_; }
 
+  bool is_coalesced_stream_enabled() const { return coalesced_stream_enabled_; 
}
+
   void enable_buffered_stream() { buffered_stream_enabled_ = true; }
 
   void disable_buffered_stream() { buffered_stream_enabled_ = false; }
 
+  /// Enable read coalescing.
+  ///
+  /// When enabled, readers can optionally call
+  /// ParquetFileReader.PreBuffer to cache the necessary slices of the
+  /// file in-memory before deserialization. Arrow readers
+  /// automatically do this. This is intended to increase performance
+  /// when reading from high-latency filesystems (e.g. Amazon S3).
+  ///
+  /// When this is enabled, it is NOT SAFE to concurrently create
+  /// RecordBatchReaders from the same file.

Review comment:
       I've updated the wording - it's not unsafe (in that it won't crash), but 
it'll fail if you try to read a row group/column chunk that wasn't cached. This 
is because there's one shared cache that is not specific to a particular 
reader. Looking through the code, we could make this work better by avoiding a 
PreBuffer method and instead threading the cache all the way from the Arrow 
readers to the Parquet implementation. I think that's possible, but would 
require passing the cache between several classes/methods - do you think that's 
worth it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to