[jira] [Commented] (ARROW-16409) [C++][Python][R] Deprecate "scanner" (but keep "scan node") from public API

Weston Pace (Jira) Wed, 04 May 2022 07:44:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531752#comment-17531752
 ]


Weston Pace commented on ARROW-16409:
-------------------------------------

That is an important behavior.  In R, which has already abandoned the scanner, 
we can't do row-count queries.  However, my hope would be to add that to our 
query options so that a scan node with an empty list of columns will do a 
metadata-only query.

> [C++][Python][R] Deprecate "scanner" (but keep "scan node") from public API
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-16409
>                 URL: https://issues.apache.org/jira/browse/ARROW-16409
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>             Fix For: 9.0.0
>
>
> The scanner, in its original form, was something of a prototype query engine. 
>  It handled complex projection (beyond just casting) and filtering.  Over 
> time features have been moved out of the scanner and into the execution 
> engine to the point that the scanner now is just a tool for scanning multiple 
> files simultaneously to feed as input to an exec plan (i.e. "scan node").
> The concept of a "scanner" should mostly be removed from our public API 
> surface.  Those working directly with the execution engine will still need to 
> know about the scan node but that should be about it.
> For example, in python we have pages [like 
> this|https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html]
>  and code like this:
> {noformat}
> dataset = ds.dataset('/tmp/my_dataset', format='parquet')
> scanner = dataset.scanner(columns=['x'])
> ds.write_dataset(scanner, '/tmp/my_new_dataset', format='parquet')
> {noformat}
> Over time I think this will lead to confusion.  It's already a little 
> convoluted.  For example, a call to {{dataset.to_table(...)}} creates a 
> {{Scanner}} and calls {{ToTable}} with {{ScanOptions}}.  This method then 
> creates an {{ExecPlan}} and, in order to do so, must create a {{ScanNode}}.  
> The {{ScanNode}} consumes some (but not all) of the options in {{ScanOption}} 
> while the {{ExecPlan}} consumes the rest.
> The {{Scanner}} (if one continues to exist) should be an internal detail not 
> visible to users.  The previous code could either change to use a new term 
> {{query}}:
> {noformat}
> dataset = ds.dataset('/tmp/my_dataset', format='parquet')
> query = dataset.query(columns=['x'])
> ds.write_dataset(query, '/tmp/my_new_dataset', format='parquet')
> {noformat}
> Or we could use the record batch reader concept:
> {noformat}
> dataset = ds.dataset('/tmp/my_dataset', format='parquet')
> record_batch_reader = dataset.to_reader(columns=['x'])
> ds.write_dataset(record_batch_reader, '/tmp/my_new_dataset', format='parquet')
> {noformat}
> I would like to make some changes to the scanner in 9.0.0 and would hope to 
> address this then so I'm happy to hear opinions / thoughts.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16409) [C++][Python][R] Deprecate "scanner" (but keep "scan node") from public API

Reply via email to