[GitHub] [arrow] westonpace commented on a change in pull request #12123: ARROW-15300: [C++] Update Skyhook for async dataset interfaces

GitBox Wed, 12 Jan 2022 10:09:22 -0800


westonpace commented on a change in pull request #12123:
URL: https://github.com/apache/arrow/pull/12123#discussion_r783311353




##########
File path: cpp/src/skyhook/client/file_skyhook.cc
##########
@@ -95,10 +43,13 @@ class SkyhookFileFormat::Impl {
     return arrow::Status::OK();
   }
 
-  arrow::Result<arrow::dataset::ScanTaskIterator> ScanFile(
+  arrow::Result<arrow::RecordBatchGenerator> ScanBatchesAsync(
+      const std::shared_ptr<const SkyhookFileFormat>& format,
       const std::shared_ptr<arrow::dataset::ScanOptions>& options,
       const std::shared_ptr<arrow::dataset::FileFragment>& file) const {
-    /// Make sure client-side filtering and projection is turned off.
+    // TODO: investigate "true" async version
+
+    // Make sure client-side filtering and projection is turned off.
     file->apply_compute = false;

Review comment:
       This flag was only ever checked in the synchronous path.  Now that it is 
gone this flag is somewhat ineffective.  A more accurate thing to do would be 
to attach the filter as a guarantee to all batches produced by this format.
   
   Skipping scanner projection might still be nice I suppose but its more niche 
and a little trickier on the async / exec plan path.  For example in 
`AsyncScanner::ScanBatchesUnorderedAsync` if the user is using Skyhook then 
instead of...
   
   ```
     auto exprs = scan_options_->projection.call()->arguments;
     auto names = checked_cast<const compute::MakeStructOptions*>(
                      scan_options_->projection.call()->options.get())
                      ->field_names;
   ```
   
   ...we would need the same names but the exprs should all be field_refs onto 
the names themselves.  For example, if the user asked for 
`expr=call("multiply", {field_ref("a"), literal(2)});` and `name="a_times_two"` 
then we would pass that expr/name down to Skyhook and in the augmented_project 
node we could pass `expr=field_ref("a_times_two")` and `name="a_times_two"`.
   
   I think we can defer this to a follow-up PR if needed.  The issue already 
existed.  It will only affect users that are using custom projection (which 
isn't well documented yet and we haven't seen much of).
   
   Also, the double-filter-if-running-async behavior already existed and is 
harmless except for a minor hit to performance.

##########
File path: cpp/src/skyhook/client/file_skyhook.cc
##########
@@ -95,10 +43,13 @@ class SkyhookFileFormat::Impl {
     return arrow::Status::OK();
   }
 
-  arrow::Result<arrow::dataset::ScanTaskIterator> ScanFile(
+  arrow::Result<arrow::RecordBatchGenerator> ScanBatchesAsync(
+      const std::shared_ptr<const SkyhookFileFormat>& format,
       const std::shared_ptr<arrow::dataset::ScanOptions>& options,
       const std::shared_ptr<arrow::dataset::FileFragment>& file) const {
-    /// Make sure client-side filtering and projection is turned off.
+    // TODO: investigate "true" async version

Review comment:
       Can you be more specific what a true async version would do?  Since this 
is basically making a network request and waiting for it to finish I don't know 
that we need to do much more than you've already done here.  The only thing I 
could think of would be maybe transferring to the CPU thread pool before 
deserializing the table.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #12123: ARROW-15300: [C++] Update Skyhook for async dataset interfaces

Reply via email to