Weston Pace created ARROW-16409:
-----------------------------------
Summary: [C++][Python][R] Deprecate "scanner" (but keep "scan
node") from public API
Key: ARROW-16409
URL: https://issues.apache.org/jira/browse/ARROW-16409
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Weston Pace
Assignee: Weston Pace
Fix For: 9.0.0
The scanner, in its original form, was something of a prototype query engine.
It handled complex projection (beyond just casting) and filtering. Over time
features have been moved out of the scanner and into the execution engine to
the point that the scanner now is just a tool for scanning multiple files
simultaneously to feed as input to an exec plan (i.e. "scan node").
The concept of a "scanner" should mostly be removed from our public API
surface. Those working directly with the execution engine will still need to
know about the scan node but that should be about it.
For example, in python we have pages [like
this|https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html]
and code like this:
{noformat}
dataset = ds.dataset('/tmp/my_dataset', format='parquet')
scanner = dataset.scanner(columns=['x'])
ds.write_dataset(scanner, '/tmp/my_new_dataset', format='parquet')
{noformat}
Over time I think this will lead to confusion. It's already a little
convoluted. For example, a call to {{dataset.to_table(...)}} creates a
{{Scanner}} and calls {{ToTable}} with {{ScanOptions}}. This method then
creates an {{ExecPlan}} and, in order to do so, must create a {{ScanNode}}.
The {{ScanNode}} consumes some (but not all) of the options in {{ScanOption}}
while the {{ExecPlan}} consumes the rest.
The {{Scanner}} (if one continues to exist) should be an internal detail not
visible to users. The previous code could either change to use a new term
{{query}}:
{noformat}
dataset = ds.dataset('/tmp/my_dataset', format='parquet')
query = dataset.query(columns=['x'])
ds.write_dataset(query, '/tmp/my_new_dataset', format='parquet')
{noformat}
Or we could use the record batch reader concept:
{noformat}
dataset = ds.dataset('/tmp/my_dataset', format='parquet')
record_batch_reader = dataset.to_reader(columns=['x'])
ds.write_dataset(record_batch_reader, '/tmp/my_new_dataset', format='parquet')
{noformat}
I would like to make some changes to the scanner in 9.0.0 and would hope to
address this then so I'm happy to hear opinions / thoughts.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)