westonpace commented on a change in pull request #12123: URL: https://github.com/apache/arrow/pull/12123#discussion_r783311353
########## File path: cpp/src/skyhook/client/file_skyhook.cc ########## @@ -95,10 +43,13 @@ class SkyhookFileFormat::Impl { return arrow::Status::OK(); } - arrow::Result<arrow::dataset::ScanTaskIterator> ScanFile( + arrow::Result<arrow::RecordBatchGenerator> ScanBatchesAsync( + const std::shared_ptr<const SkyhookFileFormat>& format, const std::shared_ptr<arrow::dataset::ScanOptions>& options, const std::shared_ptr<arrow::dataset::FileFragment>& file) const { - /// Make sure client-side filtering and projection is turned off. + // TODO: investigate "true" async version + + // Make sure client-side filtering and projection is turned off. file->apply_compute = false; Review comment: This flag was only ever checked in the synchronous path. Now that it is gone this flag is somewhat ineffective. A more accurate thing to do would be to attach the filter as a guarantee to all batches produced by this format. Skipping scanner projection might still be nice I suppose but its more niche and a little trickier on the async / exec plan path. For example in `AsyncScanner::ScanBatchesUnorderedAsync` if the user is using Skyhook then instead of... ``` auto exprs = scan_options_->projection.call()->arguments; auto names = checked_cast<const compute::MakeStructOptions*>( scan_options_->projection.call()->options.get()) ->field_names; ``` ...we would need the same names but the exprs should all be field_refs onto the names themselves. For example, if the user asked for `expr=call("multiply", {field_ref("a"), literal(2)});` and `name="a_times_two"` then we would pass that expr/name down to Skyhook and in the augmented_project node we could pass `expr=field_ref("a_times_two")` and `name="a_times_two"`. I think we can defer this to a follow-up PR if needed. The issue already existed. It will only affect users that are using custom projection (which isn't well documented yet and we haven't seen much of). Also, the double-filter-if-running-async behavior already existed and is harmless except for a minor hit to performance. ########## File path: cpp/src/skyhook/client/file_skyhook.cc ########## @@ -95,10 +43,13 @@ class SkyhookFileFormat::Impl { return arrow::Status::OK(); } - arrow::Result<arrow::dataset::ScanTaskIterator> ScanFile( + arrow::Result<arrow::RecordBatchGenerator> ScanBatchesAsync( + const std::shared_ptr<const SkyhookFileFormat>& format, const std::shared_ptr<arrow::dataset::ScanOptions>& options, const std::shared_ptr<arrow::dataset::FileFragment>& file) const { - /// Make sure client-side filtering and projection is turned off. + // TODO: investigate "true" async version Review comment: Can you be more specific what a true async version would do? Since this is basically making a network request and waiting for it to finish I don't know that we need to do much more than you've already done here. The only thing I could think of would be maybe transferring to the CPU thread pool before deserializing the table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org