[
https://issues.apache.org/jira/browse/ARROW-16409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531752#comment-17531752
]
Weston Pace commented on ARROW-16409:
-------------------------------------
That is an important behavior. In R, which has already abandoned the scanner,
we can't do row-count queries. However, my hope would be to add that to our
query options so that a scan node with an empty list of columns will do a
metadata-only query.
> [C++][Python][R] Deprecate "scanner" (but keep "scan node") from public API
> ---------------------------------------------------------------------------
>
> Key: ARROW-16409
> URL: https://issues.apache.org/jira/browse/ARROW-16409
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Assignee: Weston Pace
> Priority: Major
> Fix For: 9.0.0
>
>
> The scanner, in its original form, was something of a prototype query engine.
> It handled complex projection (beyond just casting) and filtering. Over
> time features have been moved out of the scanner and into the execution
> engine to the point that the scanner now is just a tool for scanning multiple
> files simultaneously to feed as input to an exec plan (i.e. "scan node").
> The concept of a "scanner" should mostly be removed from our public API
> surface. Those working directly with the execution engine will still need to
> know about the scan node but that should be about it.
> For example, in python we have pages [like
> this|https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html]
> and code like this:
> {noformat}
> dataset = ds.dataset('/tmp/my_dataset', format='parquet')
> scanner = dataset.scanner(columns=['x'])
> ds.write_dataset(scanner, '/tmp/my_new_dataset', format='parquet')
> {noformat}
> Over time I think this will lead to confusion. It's already a little
> convoluted. For example, a call to {{dataset.to_table(...)}} creates a
> {{Scanner}} and calls {{ToTable}} with {{ScanOptions}}. This method then
> creates an {{ExecPlan}} and, in order to do so, must create a {{ScanNode}}.
> The {{ScanNode}} consumes some (but not all) of the options in {{ScanOption}}
> while the {{ExecPlan}} consumes the rest.
> The {{Scanner}} (if one continues to exist) should be an internal detail not
> visible to users. The previous code could either change to use a new term
> {{query}}:
> {noformat}
> dataset = ds.dataset('/tmp/my_dataset', format='parquet')
> query = dataset.query(columns=['x'])
> ds.write_dataset(query, '/tmp/my_new_dataset', format='parquet')
> {noformat}
> Or we could use the record batch reader concept:
> {noformat}
> dataset = ds.dataset('/tmp/my_dataset', format='parquet')
> record_batch_reader = dataset.to_reader(columns=['x'])
> ds.write_dataset(record_batch_reader, '/tmp/my_new_dataset', format='parquet')
> {noformat}
> I would like to make some changes to the scanner in 9.0.0 and would hope to
> address this then so I'm happy to hear opinions / thoughts.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)