[GitHub] [arrow] westonpace commented on a diff in pull request #34461: GH-34460: [C++][Parquet] Split arrow::FileReader::ReadRowGroups() for flexible async IO

via GitHub Mon, 06 Mar 2023 13:21:46 -0800


westonpace commented on code in PR #34461:
URL: https://github.com/apache/arrow/pull/34461#discussion_r1127045179



##########
cpp/src/parquet/arrow/reader.h:
##########
@@ -249,6 +249,13 @@ class PARQUET_EXPORT FileReader {
 
   virtual ::arrow::Status ReadRowGroup(int i, std::shared_ptr<::arrow::Table>* 
out) = 0;
 
+  virtual ::arrow::Status WillNeedRowGroups(const std::vector<int>& row_groups,
+                                            const std::vector<int>& 
column_indices) = 0;

Review Comment:
   https://github.com/apache/arrow/pull/14723 adds a filesystem method for 
"read many".  I would like to see this method support plugging and splitting in 
the same way that `ReadRangeCache` does today (then, `ReadRangeCache` will only 
be needed if you need true "caching").  Then I think we can use that instead of 
the `ReadRangeCache`.
   
   This will allow local filesystems to rely on the OS for plugging & splitting 
and will allow remote filesystems like S3 to adapt the algorithm to their 
needs.  It's also async and returns a future reliably so you can then return a 
future from this method (I agree that would be desired).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a diff in pull request #34461: GH-34460: [C++][Parquet] Split arrow::FileReader::ReadRowGroups() for flexible async IO

Reply via email to