wesm commented on a change in pull request #6744: URL: https://github.com/apache/arrow/pull/6744#discussion_r412465641
########## File path: cpp/src/parquet/properties.h ########## @@ -56,10 +60,32 @@ class PARQUET_EXPORT ReaderProperties { bool is_buffered_stream_enabled() const { return buffered_stream_enabled_; } + bool is_coalesced_stream_enabled() const { return coalesced_stream_enabled_; } + void enable_buffered_stream() { buffered_stream_enabled_ = true; } void disable_buffered_stream() { buffered_stream_enabled_ = false; } + /// Enable read coalescing. + /// + /// When enabled, readers can optionally call + /// ParquetFileReader.PreBuffer to cache the necessary slices of the + /// file in-memory before deserialization. Arrow readers + /// automatically do this. This is intended to increase performance + /// when reading from high-latency filesystems (e.g. Amazon S3). + /// + /// When this is enabled, it is NOT SAFE to concurrently create + /// RecordBatchReaders from the same file. Review comment: Could it be made safe? ########## File path: cpp/src/parquet/file_reader.h ########## @@ -117,6 +117,15 @@ class PARQUET_EXPORT ParquetFileReader { // Returns the file metadata. Only one instance is ever created std::shared_ptr<FileMetaData> metadata() const; + /// Pre-buffer the specified column indices in all row groups. + /// + /// Only has an effect if ReaderProperties.is_coalesced_stream_enabled is set; + /// otherwise this is a no-op. The reader internally maintains a cache which is + /// overwritten each time this is called. Intended to increase performance on + /// high-latency filesystems (e.g. Amazon S3). + void PreBuffer(const std::vector<int>& row_groups, + const std::vector<int>& column_indices); Review comment: My biggest question is whether this detail should be exposed here or handled automatically under the hood. I don't have a strong opinion ########## File path: cpp/src/parquet/arrow/reader.cc ########## @@ -803,6 +809,9 @@ Status FileReaderImpl::ReadRowGroups(const std::vector<int>& row_groups, return Status::Invalid("Invalid column index"); } + // PARQUET-1698/PARQUET-1820: pre-buffer row groups/column chunks if enabled + parquet_reader()->PreBuffer(row_groups, indices); Review comment: Per comments below, it seems slightly odd that this would be a no-op sometimes. It seems like it would be better to have ``` if (properties_.prebuffering_enabled()) { parquet_reader()->PreBuffer(...); } ``` ########## File path: cpp/src/parquet/properties.h ########## @@ -56,10 +60,32 @@ class PARQUET_EXPORT ReaderProperties { bool is_buffered_stream_enabled() const { return buffered_stream_enabled_; } + bool is_coalesced_stream_enabled() const { return coalesced_stream_enabled_; } Review comment: So rather than having this flag here, what if you put a prebuffer option in `ArrowReadProperties`? Then `FileReader::PreBuffer` will always prebuffer if it's called ########## File path: cpp/src/parquet/file_reader.h ########## @@ -117,6 +117,15 @@ class PARQUET_EXPORT ParquetFileReader { // Returns the file metadata. Only one instance is ever created std::shared_ptr<FileMetaData> metadata() const; + /// Pre-buffer the specified column indices in all row groups. + /// + /// Only has an effect if ReaderProperties.is_coalesced_stream_enabled is set; Review comment: Does this detail need to be exposed? The first time `PreBuffer` is called, could the cached reader be created? ########## File path: cpp/src/arrow/filesystem/s3fs_benchmark.cc ########## @@ -141,16 +142,31 @@ class MinioFixture : public benchmark::Fixture { } /// Make an object with Parquet data. + /// Appends integer columns to the beginning (to act as indices). Status MakeParquetObject(const std::string& path, int num_columns, int num_rows) { - std::vector<std::shared_ptr<ChunkedArray>> columns(num_columns); - std::vector<std::shared_ptr<Field>> fields(num_columns); - for (int i = 0; i < num_columns; ++i) { + std::vector<std::shared_ptr<ChunkedArray>> columns; + std::vector<std::shared_ptr<Field>> fields; + + { + arrow::random::RandomArrayGenerator generator(0); + std::shared_ptr<Array> values = generator.Int64(num_rows, 0, 1e10, 0); + columns.push_back(std::make_shared<ChunkedArray>(values)); + fields.push_back(::arrow::field("timestamp", values->type())); + } + { + arrow::random::RandomArrayGenerator generator(1); + std::shared_ptr<Array> values = generator.Int32(num_rows, 0, 1e9, 0); + columns.push_back(std::make_shared<ChunkedArray>(values)); + fields.push_back(::arrow::field("val", values->type())); + } + + for (int i = 0; i < num_columns; i++) { arrow::random::RandomArrayGenerator generator(i); std::shared_ptr<Array> values = generator.Float64(num_rows, -1.e10, 1e10, 0); std::stringstream ss; ss << "col" << i; - columns[i] = std::make_shared<ChunkedArray>(values); - fields[i] = ::arrow::field(ss.str(), values->type()); + columns.push_back(std::make_shared<ChunkedArray>(values)); + fields.push_back(::arrow::field(ss.str(), values->type())); } Review comment: Aside: this should stretch of code seems to by crying out for a simpler table builder API. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org