[jira] [Created] (ARROW-11001) [C++][Dataset] Enable column renaming (in physical schema -> dataset schema) in Dataset scanning

Joris Van den Bossche (Jira) Mon, 21 Dec 2020 09:01:37 -0800

Joris Van den Bossche created ARROW-11001:
---------------------------------------------


             Summary: [C++][Dataset] Enable column renaming (in physical schema 
-> dataset schema) in Dataset scanning
                 Key: ARROW-11001
                 URL: https://issues.apache.org/jira/browse/ARROW-11001
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Joris Van den Bossche


Currently, we allow dropping/adding columns when scanning the actual sources of 
a Dataset (e.g. if newer files in the dataset have additional columns), but we 
should also provide a way to specify fields that are renamed in certain files 
of the dataset.

While it _might_ be possible to also provide some convenience for this in the 
discovery factories, it's probably best to start to see how this could be added 
to the actual {{Dataset}} class and the lower-level constructor functionalities 
(such as {{FileSystemDataset}} main constructor from fragments or 
{{from_paths}}). 

What I am thinking right now, is that we would need an (optional) mapping of 
"field ref/name in physical schema -> name in projected/dataset schema" for 
each fragment of a dataset. 
However, that might not fully fit in the current design, as the fragment 
doesn't know about the dataset schema, but only sees this when it is projected. 

cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11001) [C++][Dataset] Enable column renaming (in physical schema -> dataset schema) in Dataset scanning

Reply via email to