Joris Van den Bossche created ARROW-11001:
---------------------------------------------
Summary: [C++][Dataset] Enable column renaming (in physical schema
-> dataset schema) in Dataset scanning
Key: ARROW-11001
URL: https://issues.apache.org/jira/browse/ARROW-11001
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Joris Van den Bossche
Currently, we allow dropping/adding columns when scanning the actual sources of
a Dataset (e.g. if newer files in the dataset have additional columns), but we
should also provide a way to specify fields that are renamed in certain files
of the dataset.
While it _might_ be possible to also provide some convenience for this in the
discovery factories, it's probably best to start to see how this could be added
to the actual {{Dataset}} class and the lower-level constructor functionalities
(such as {{FileSystemDataset}} main constructor from fragments or
{{from_paths}}).
What I am thinking right now, is that we would need an (optional) mapping of
"field ref/name in physical schema -> name in projected/dataset schema" for
each fragment of a dataset.
However, that might not fully fit in the current design, as the fragment
doesn't know about the dataset schema, but only sees this when it is projected.
cc [~bkietz]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)