Github user ssimeonov commented on a diff in the pull request: https://github.com/apache/spark/pull/21840#discussion_r204245778 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala --- @@ -1234,6 +1234,8 @@ class Column(val expr: Expression) extends Logging { */ def over(): Column = over(Window.spec) + def copy(field: String, value: Column): Column = withExpr(StructCopy(expr, field, value.expr)) --- End diff -- Some things to consider about the API: - How is custom metadata associated with the updated field? - How can a field be deleted? - How can a field be added? - When a field is added, where does it go in the schema? The only logical place is at the end but that may not be what's desired in some cases. Simply for discussion purposes (overloaded methods are not shown): ```scala class Column(val expr: Expression) extends Logging { // ... // matches Dataset.schema semantics; errors on non-struct columns def schema: StructType // matches Dataset.select() semantics, errors on non-struct columns // '* support allows multiple new fields to be added easily, saving cumbersome repeated withColumn() calls def select(cols: Column*): Column // matches Dataset.withColumn() semantics of add or replace def withColumn(colName: String, col: Column): Column // matches Dataset.drop() semantics def drop(colName: String): Column } ``` The benefit of the above API is that it unifies manipulating top-level & nested columns, which I would argue is very desirable. The addition of `schema` and `select()` allows for nested field reordering, casting, etc., which is important in data exchange scenarios where field position matters. /cc @rxin
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org