[
https://issues.apache.org/jira/browse/ARROW-13340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alessandro Molina updated ARROW-13340:
--------------------------------------
Fix Version/s: (was: 6.0.0)
7.0.0
> [C++][Dataset] Simplify ScanOptions after complexity has moved to ScanNode
> --------------------------------------------------------------------------
>
> Key: ARROW-13340
> URL: https://issues.apache.org/jira/browse/ARROW-13340
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Ben Kietzman
> Priority: Major
> Labels: dataset
> Fix For: 7.0.0
>
>
> ScanOptions currently has a number of constraints between members, which
> violates the contract of a public struct:
> - {{filter}} must be bound to {{dataset_schema}}
> - {{projection}} must be bound to {{dataset_schema}}
> - {{projected_schema}} must be {{schema<...fields>}}, where the type of
> projection is {{struct<...fields>}}
> These are currently required to support {{FilterAndProjectScanTask}}, but
> after ARROW-13328 this complexity can be removed and ScanOptions can be a
> pure struct argument to {{MakeScanNode}}. Specifically, it should be possible
> to:
> - remove the {{projected_schema}} field (ScanNode doesn't need to know the
> schemas of any subsequent nodes)
> - remove the {{projection}} field (ScanNode doesn't need to know how or if
> scanned batches will be projected)
> - provide a simple vector of {{FieldRef}} to indicate which fields should be
> materialized (MakeScanNode can validate that this includes every field
> referenced by {{filter}})
> - allow {{filter}} to be unbound (MakeScanNode can bind it to the dataset
> schema)
> {{dataset_schema}} seems slightly redundant too since MakeScanNode also takes
> a Dataset as an argument but it is currently used by CsvFileFormat to derive
> column types
--
This message was sent by Atlassian Jira
(v8.3.4#803005)