westonpace commented on a change in pull request #10913:
URL: https://github.com/apache/arrow/pull/10913#discussion_r699822754
##########
File path: cpp/src/arrow/dataset/scanner_internal.h
##########
@@ -185,6 +185,10 @@ inline Result<ScanTaskIterator> GetScanTaskIterator(
auto fn = [options](std::shared_ptr<Fragment> fragment) ->
Result<ScanTaskIterator> {
ARROW_ASSIGN_OR_RAISE(auto scan_task_it, fragment->Scan(options));
+ if (options->skip_compute) {
Review comment:
For this PR I think I recommend that we leave this alone. It should be
ok to leave `skip_compute` in the sync scanner (as it is eventually going away).
Long term there is work underway to move the internals of the scanner out
into something called the ExecPlan. The ExecPlan is a relational algebra query
engine (some might say query engine backend). It runs the data through a
series of operators (scan, project, filter, join, etc.)
If a fragment provides its own filtering capabilities then I believe the
appropriate way to handle this will be for the fragment to attach a "guarantee"
to the record batch.
In addition, fragments may not be able to handle the entire filter. For
example, imagine the SQL query `SELECT a.id, a.name from A a INNER JOIN B b ON
a.id = b.a_id WHERE a.value < b.value AND b.type = 'x'` and imagine that the
data for `a` comes from an SQL database somewhere and the data for `b` comes
from a skyhook scan. There are a lot of details that still need to be worked
out (and a fair amount of discussion on the mailing list how this might happen)
but I don't think it would be correct for Skyhook to simply tell the query
engine to "not filter" because the "a.value < b.value" portion of the filter
will need to be handled by the query engine. Instead, skyhook would attach the
guarantee "b.type = 'x'" to the returned batches.
So, long story short, let's tackle this problem a later day.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]