[jira] [Commented] (ARROW-12311) [Python][R] Expose (hide?) ScanOptions

David Li (Jira) Fri, 11 Feb 2022 14:49:04 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-12311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491199#comment-17491199
 ]


David Li commented on ARROW-12311:
----------------------------------

* I think the projection behavior is the same, but this is a "user friendly" 
way of putting it. I suppose before you could have the projection be something 
completely arbitrary, if you implemented your own kernel returning a Struct 
type, but I don't think that's useful to support. However it seems we only 
support selecting/renaming/casting fields, not anything more complex? (Oh wait 
- "scan node" will become something distinct from the "scanner"? I'm a little 
confused here now.)
* Do we want to error on reordered fields? That could matter if we allow 
indices for projection.
* For multiple formats, we should perhaps consider the proposal in ARROW-11981: 
"Dataset could be simplified to a concrete class containing a set of compatibly 
typed/formatted Fragments". 

> [Python][R] Expose (hide?) ScanOptions
> --------------------------------------
>
>                 Key: ARROW-12311
>                 URL: https://issues.apache.org/jira/browse/ARROW-12311
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python, R
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>             Fix For: 8.0.0
>
>
> Currently R completely hides the `ScanOptions` class.
> In python the class is exposed but the documentation prefers `dataset.scan` 
> (which hides both the scanner and the scan options).
> However, there is some useful information in the `ScanOptions`.  
> Specifically, the projected schema (which is a product of the dataset schema 
> and the projection expression and not easily recreated) and the materialized 
> fields (the list of fields referenced by either the filter or the projection) 
> which might be useful for reporting purposes.
> Currently R uses the projected schema to convert a list of column names into 
> a partition schema.  Python does not rely on either field.
>  
> Options:
>  - Keep the status quo
>  - Expose the ScanOptions object (which itself is exposed via the Scanner)
>  - Expose the interesting fields via the Scanner
>  
> Currently the C++ design is halfway between the latter two (projected schema 
> is exposed and options).  My preference would be the third option.  It raises 
> a further question about how to expose the scanner itself in Python?  Should 
> the user be using ScannerBuilder?  Should they use NewScan?  Should they use 
> the scanner directly at all or should it be hidden?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-12311) [Python][R] Expose (hide?) ScanOptions

Reply via email to