[ 
https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas reopened SPARK-16483:
--------------------------------------

Though this has been bulk closed, I still think it's a valuable potential 
improvement to the DataFrame API that makes working with nested datasets much 
more seamless and natural. (SPARK-18084 and SPARK-18277 are two hiccups I hit 
myself that would be addressed by this ticket.)

I hope this will get another look after Spark 3.0 is out.

> Unifying struct fields and columns
> ----------------------------------
>
>                 Key: SPARK-16483
>                 URL: https://issues.apache.org/jira/browse/SPARK-16483
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Simeon Simeonov
>            Priority: Major
>              Labels: bulk-closed, sql
>
> This issue comes as a result of an exchange with Michael Armbrust outside of 
> the usual JIRA/dev list channels.
> DataFrame provides a full set of manipulation operations for top-level 
> columns. They have be added, removed, modified and renamed. The same is not 
> true about fields inside structs yet, from a logical standpoint, Spark users 
> may very well want to perform the same operations on struct fields, 
> especially since automatic schema discovery from JSON input tends to create 
> deeply nested structs.
> Common use-cases include:
>  - Remove and/or rename struct field(s) to adjust the schema
>  - Fix a data quality issue with a struct field (update/rewrite)
> To do this with the existing API by hand requires manually calling 
> {{named_struct}} and listing all fields, including ones we don't want to 
> manipulate. This leads to complex, fragile code that cannot survive schema 
> evolution.
> It would be far better if the various APIs that can now manipulate top-level 
> columns were extended to handle struct fields at arbitrary locations or, 
> alternatively, if we introduced new APIs for modifying any field in a 
> dataframe, whether it is a top-level one or one nested inside a struct.
> Purely for discussion purposes (overloaded methods are not shown):
> {code:java}
> class Column(val expr: Expression) extends Logging {
>   // ...
>   // matches Dataset.schema semantics
>   def schema: StructType
>   // matches Dataset.select() semantics
>   // '* support allows multiple new fields to be added easily, saving 
> cumbersome repeated withColumn() calls
>   def select(cols: Column*): Column
>   // matches Dataset.withColumn() semantics of add or replace
>   def withColumn(colName: String, col: Column): Column
>   // matches Dataset.drop() semantics
>   def drop(colName: String): Column
> }
> class Dataset[T] ... {
>   // ...
>   // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema)
>   def cast(newShema: StructType): DataFrame
> }
> {code}
> The benefit of the above API is that it unifies manipulating top-level & 
> nested columns. The addition of {{schema}} and {{select()}} to {{Column}} 
> allows for nested field reordering, casting, etc., which is important in data 
> exchange scenarios where field position matters. That's also the reason to 
> add {{cast}} to {{Dataset}}: it improves consistency and readability (with 
> method chaining). Another way to think of {{Dataset.cast}} is as the Spark 
> schema equivalent of {{Dataset.as}}. {{as}} is to {{cast}} as a Scala 
> encodable type is to a {{StructType}} instance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to