[ 
https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16483:
--------------------------------
    Target Version/s:   (was: 2.1.0)

> Unifying struct fields and columns
> ----------------------------------
>
>                 Key: SPARK-16483
>                 URL: https://issues.apache.org/jira/browse/SPARK-16483
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Simeon Simeonov
>              Labels: sql
>
> This issue comes as a result of an exchange with Michael Armbrust outside of 
> the usual JIRA/dev list channels. 
> DataFrame provides a full set of manipulation operations for top-level 
> columns. They have be added, removed, modified and renamed. The same is not 
> true about fields inside structs yet, from a logical standpoint, Spark users 
> may very well want to perform the same operations on struct fields, 
> especially since automatic schema discovery from JSON input tends to create 
> deeply nested structs.
> Common use-cases include:
> - Remove and/or rename struct field(s) to adjust the schema
> - Fix a data quality issue with a struct field (update/rewrite)
> To do this with the existing API by hand requires manually calling 
> {{named_struct}} and listing all fields, including ones we don't want to 
> manipulate. This leads to complex, fragile code that cannot survive schema 
> evolution.
> It would be far better if the various APIs that can now manipulate top-level 
> columns were extended to handle struct fields at arbitrary locations or, 
> alternatively, if we introduced new APIs for modifying any field in a 
> dataframe, whether it is a top-level one or one nested inside a struct.
> Purely for discussion purposes, here is the skeleton implementation of an 
> update() implicit that we've use to modify any existing field in a dataframe. 
> (Note that it depends on various other utilities and implicits that are not 
> included). https://gist.github.com/ssimeonov/f98dcfa03cd067157fa08aaa688b0f66



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to