[
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17935327#comment-17935327
]
Andrew Otto commented on SPARK-23890:
-------------------------------------
[TIL|https://phabricator.wikimedia.org/T209453#10632894] that Iceberg supports
adding columns to structs in map and array values struct with .value column
name referencing!
[https://iceberg.apache.org/docs/nightly/spark-ddl/#alter-table-add-column]
-- create a map column of struct key and struct valueALTER TABLE prod.db.sample
ADD COLUMN points map<struct<x: int>, struct<a: int>>;-- add a field to the
value struct in a map. Using keyword 'value' to access the map's value
column.ALTER TABLE prod.db.sample
ADD COLUMN points.value.b int;
And, that works via Spark!
I guess there is a bug though:
[apache/iceberg#2962|https://github.com/apache/iceberg/issues/2962]
> Support DDL for adding nested columns to struct types
> -----------------------------------------------------
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0, 3.0.0
> Reporter: Andrew Otto
> Priority: Major
> Labels: bulk-closed, pull-request-available
> Fix For: 3.0.0
>
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE
> CHANGE COLUMN commands to Hive. This restriction was loosened in
> [https://github.com/apache/spark/pull/12714] to allow for those commands if
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally
> from JSON events by adding newly found columns to the Hive table schema, via
> a Spark job we call 'Refine'. We do this by recursively merging an input
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and
> then issuing an ALTER TABLE statement to add the columns. However, because
> we allow for nested data types in the incoming JSON data, we make extensive
> use of struct type fields. In order to add newly detected fields in a nested
> data type, we must alter the struct column and append the nested struct
> field. This requires CHANGE COLUMN that alters the column type. In reality,
> the 'type' of the column is not changing, it just just a new field being
> added to the struct, but to SQL, this looks like a type change.
> -We were about to upgrade to Spark 2 but this new restriction in SQL DDL that
> can be sent to Hive will block us. I believe this is fixable by adding an
> exception in
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
> to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and
> destination type are both struct types, and the destination type only adds
> new fields.-
>
> In this [PR|https://github.com/apache/spark/pull/21012], I was told that the
> Spark 3 datasource v2 would support this.
> However, it is clear that it does not. There is an [explicit
> check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441]
> and
> [test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583]
> that prevents this from happening.
> This an be done via {{{}ALTER TABLE ADD COLUMN nested1.new_field1{}}}, but
> this is not supported for any datasource v1 sources.
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]