[
https://issues.apache.org/jira/browse/ARROW-12311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318001#comment-17318001
]
Joris Van den Bossche edited comment on ARROW-12311 at 4/9/21, 1:52 PM:
------------------------------------------------------------------------
Agreed on option 3, if we can more easily expose the Scanner object in Python,
that seems the appropriate place to expose interesting fields (like the
projected schema).
It's a bit a pity that {{Dataset.scan()}} already creates a Scanner ànd starts
the scan, otherwise that method could have returned a Scanner instead.
Basically it's {{Dataset._scanner(**kwargs)}} that we want to expose from the
Dataset, right? (without the user having to call
{{Scanner.from_dataset(dataset, **kwargs)}}
was (Author: jorisvandenbossche):
Agreed on option 3, if we can more easily expose the Scanner object in Python,
that seems the appropriate place to expose interesting fields (like the
projected schema).
It's a bit a pity that {{Dataset.scan()}} already creates a Scanner and starts
the scan, otherwise that method could have returned a Scanner instead.
Basically it's {{Dataset._scanner(**kwargs)}} that we want to expose from the
Dataset, right? (without the user having to call
{{Scanner.from_dataset(dataset, **kwargs)}}
> [Python][R] Expose (hide?) ScanOptions
> --------------------------------------
>
> Key: ARROW-12311
> URL: https://issues.apache.org/jira/browse/ARROW-12311
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python, R
> Reporter: Weston Pace
> Assignee: Weston Pace
> Priority: Major
> Fix For: 5.0.0
>
>
> Currently R completely hides the `ScanOptions` class.
> In python the class is exposed but the documentation prefers `dataset.scan`
> (which hides both the scanner and the scan options).
> However, there is some useful information in the `ScanOptions`.
> Specifically, the projected schema (which is a product of the dataset schema
> and the projection expression and not easily recreated) and the materialized
> fields (the list of fields referenced by either the filter or the projection)
> which might be useful for reporting purposes.
> Currently R uses the projected schema to convert a list of column names into
> a partition schema. Python does not rely on either field.
>
> Options:
> - Keep the status quo
> - Expose the ScanOptions object (which itself is exposed via the Scanner)
> - Expose the interesting fields via the Scanner
>
> Currently the C++ design is halfway between the latter two (projected schema
> is exposed and options). My preference would be the third option. It raises
> a further question about how to expose the scanner itself in Python? Should
> the user be using ScannerBuilder? Should they use NewScan? Should they use
> the scanner directly at all or should it be hidden?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)