[
https://issues.apache.org/jira/browse/ARROW-12311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492169#comment-17492169
]
Weston Pace commented on ARROW-12311:
-------------------------------------
> I suppose before you could have the projection be something completely
> arbitrary, if you implemented your own kernel returning a Struct type, but I
> don't think that's useful to support.
Agreed, this is a little less flexible but pretty soon users will be able to
issue full fledged queries through Ibis, etc. so they can still go that route
and use an actual project node.
> Oh wait - "scan node" will become something distinct from the "scanner"? I'm
> a little confused here now.
It's already confusing. {{ScanNode}} takes a {{ScanOptions}} and creates a
{{Scanner}} yet a {{Scanner}} takes a {{ScanOptions}} and creates a
{{ScanNode}} (?!). Scanner is both a high level "lightweight query plan
producer" and a low level "file and dataset reading" utility. The low level
utility can probably become more and more internal. Users will either interact
with "scan node" via Ibis, dplyr, etc. or, if they don't want to use a 3rd
party query plan producer, they can use Scanner, which is more of a substitute
for those libraries with minimal functionality.
If we don't want to lose behavior in the high-level "convenience methods" (the
ones that belong in the "lightweight query plan producer") like "ToTable",
"ScanBatches", "Head", "Count" then we could have two "scan options" type
classes. A lower level ScanOptions that is consumed by the scan node and takes
the limited projection and a higher level ScanOptions that is consumed by the
high-level methods and creates a project node as part of the plan. The
convenience methods would have to extract the lower level projection out of the
higher level dictionary of expressions but we already have logic to do that
kind of thing today.
> Do we want to error on reordered fields? That could matter if we allow
> indices for projection.
Good point but I think that would mean the schema has to be identical (or maybe
a subset from left to right) since there isn't any good way to have a "hole" in
the schema. I think we want to allow users some way to specify the dataset
schema (e.g. the one the plan is built against and the output of the scan
operator) and some kind of "schema column resolution behavior". For example:
* Error if non-matching - If the schema doesn't match exactly then error
* Error if not-subset - If the schema doesn't match but allow missing columns
at the end
* Resolve by name - Requires unique names. Find the output column position by
looking for a column in the master schema with the same name
* Resolve by id - Same as resolve by name but use the parquet ID field
> For multiple formats, we should perhaps consider the proposal in ARROW-11981
That's a great idea. I agree completely and it helps keep things simple.
> [Python][R] Expose (hide?) ScanOptions
> --------------------------------------
>
> Key: ARROW-12311
> URL: https://issues.apache.org/jira/browse/ARROW-12311
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python, R
> Reporter: Weston Pace
> Assignee: Weston Pace
> Priority: Major
> Fix For: 8.0.0
>
>
> Currently R completely hides the `ScanOptions` class.
> In python the class is exposed but the documentation prefers `dataset.scan`
> (which hides both the scanner and the scan options).
> However, there is some useful information in the `ScanOptions`.
> Specifically, the projected schema (which is a product of the dataset schema
> and the projection expression and not easily recreated) and the materialized
> fields (the list of fields referenced by either the filter or the projection)
> which might be useful for reporting purposes.
> Currently R uses the projected schema to convert a list of column names into
> a partition schema. Python does not rely on either field.
>
> Options:
> - Keep the status quo
> - Expose the ScanOptions object (which itself is exposed via the Scanner)
> - Expose the interesting fields via the Scanner
>
> Currently the C++ design is halfway between the latter two (projected schema
> is exposed and options). My preference would be the third option. It raises
> a further question about how to expose the scanner itself in Python? Should
> the user be using ScannerBuilder? Should they use NewScan? Should they use
> the scanner directly at all or should it be hidden?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)