[
https://issues.apache.org/jira/browse/ARROW-12311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491199#comment-17491199
]
David Li commented on ARROW-12311:
----------------------------------
* I think the projection behavior is the same, but this is a "user friendly"
way of putting it. I suppose before you could have the projection be something
completely arbitrary, if you implemented your own kernel returning a Struct
type, but I don't think that's useful to support. However it seems we only
support selecting/renaming/casting fields, not anything more complex? (Oh wait
- "scan node" will become something distinct from the "scanner"? I'm a little
confused here now.)
* Do we want to error on reordered fields? That could matter if we allow
indices for projection.
* For multiple formats, we should perhaps consider the proposal in ARROW-11981:
"Dataset could be simplified to a concrete class containing a set of compatibly
typed/formatted Fragments".
> [Python][R] Expose (hide?) ScanOptions
> --------------------------------------
>
> Key: ARROW-12311
> URL: https://issues.apache.org/jira/browse/ARROW-12311
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python, R
> Reporter: Weston Pace
> Assignee: Weston Pace
> Priority: Major
> Fix For: 8.0.0
>
>
> Currently R completely hides the `ScanOptions` class.
> In python the class is exposed but the documentation prefers `dataset.scan`
> (which hides both the scanner and the scan options).
> However, there is some useful information in the `ScanOptions`.
> Specifically, the projected schema (which is a product of the dataset schema
> and the projection expression and not easily recreated) and the materialized
> fields (the list of fields referenced by either the filter or the projection)
> which might be useful for reporting purposes.
> Currently R uses the projected schema to convert a list of column names into
> a partition schema. Python does not rely on either field.
>
> Options:
> - Keep the status quo
> - Expose the ScanOptions object (which itself is exposed via the Scanner)
> - Expose the interesting fields via the Scanner
>
> Currently the C++ design is halfway between the latter two (projected schema
> is exposed and options). My preference would be the third option. It raises
> a further question about how to expose the scanner itself in Python? Should
> the user be using ScannerBuilder? Should they use NewScan? Should they use
> the scanner directly at all or should it be hidden?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)