This is an automated email from the ASF dual-hosted git repository.
bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 8f3b215e1e5 Fixing schema evolution docs (#9729)
8f3b215e1e5 is described below
commit 8f3b215e1e5c384b1d04f554569e0d2da1be45b5
Author: Sivabalan Narayanan <[email protected]>
AuthorDate: Fri Sep 15 21:31:12 2023 -0400
Fixing schema evolution docs (#9729)
---
website/docs/schema_evolution.md | 359 ++++++++++++++++++++-------------------
1 file changed, 181 insertions(+), 178 deletions(-)
diff --git a/website/docs/schema_evolution.md b/website/docs/schema_evolution.md
index 51dd67a4205..e454cb249f5 100755
--- a/website/docs/schema_evolution.md
+++ b/website/docs/schema_evolution.md
@@ -6,186 +6,10 @@ toc: true
last_modified_at: 2022-04-27T15:59:57-04:00
---
-Schema evolution allows users to easily change the current schema of a Hudi
table to adapt to the data that is changing over time.
-As of 0.11.0 release, Spark SQL (Spark 3.1.x, 3.2.1 and above) DDL support for
Schema evolution has been added and is experimental.
-
-### Scenarios
-
-1. Columns (including nested columns) can be added, deleted, modified, and
moved.
-2. Partition columns cannot be evolved.
-3. You cannot add, delete, or perform operations on nested columns of the
Array type.
-
-## SparkSQL Schema Evolution and Syntax Description
-
-Before using schema evolution, pls set `spark.sql.extensions`. For Spark 3.2.1
and above,
-`spark.sql.catalog.spark_catalog` also need to be set.
-```shell
-# Spark SQL for spark 3.1.x
-spark-sql --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.11.1 \
---conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
---conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
-
-# Spark SQL for spark 3.2.1 and above
-spark-sql --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.11.1 \
---conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
---conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
---conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
-
-```
-After start spark-app, pls exec `set hoodie.schema.on.read.enable=true` to
enable schema evolution.
-
-:::note
-Currently, Schema evolution cannot disabled once being enabled.
-:::
-
-:::tip
-When use hive metastore, may encounter a problem:
`org.apache.hadoop.hive.ql.metadata.HiveException`: Unable to alter table. The
following columns have types incompatible with the existing columns in their
respective positions.
-
-Make sure disable `hive.metastore.disallow.incompatible.col.type.changes` in
hive side.
-:::
-
-### Adding Columns
-**Syntax**
-```sql
--- add columns
-ALTER TABLE tableName ADD COLUMNS(col_spec[, col_spec ...])
-```
-**Parameter Description**
-
-| Parameter | Description
|
-|:-----------------|:---------------------------------------------------------------------------------------------------------------------|
-| tableName | Table name
|
-| col_spec | Column specifications, consisting of five fields,
*col_name*, *col_type*, *nullable*, *comment*, and *col_position*. |
-
-**col_name** : name of the new column. It is mandatory.To add a sub-column to
a nested column, specify the full name of the sub-column in this field.
-
-For example:
-
-1. To add sub-column col1 to a nested struct type column column users
struct<name: string, age: int>, set this field to users.col1.
-
-2. To add sub-column col1 to a nested map type column memeber map<string,
struct<n: string, a: int>>, set this field to member.value.col1.
-
-**col_type** : type of the new column.
-
-**nullable** : whether the new column can be null. The value can be left
empty. Now this field is not used in Hudi.
-
-**comment** : comment of the new column. The value can be left empty.
-
-**col_position** : position where the new column is added. The value can be
*FIRST* or *AFTER* origin_col.
-
-1. If it is set to *FIRST*, the new column will be added to the first column
of the table.
-
-2. If it is set to *AFTER* origin_col, the new column will be added after
original column origin_col.
-
-3. The value can be left empty. *FIRST* can be used only when new sub-columns
are added to nested columns. Do not use *FIRST* in top-level columns. There are
no restrictions about the usage of *AFTER*.
-
-**Examples**
-
-```sql
-ALTER TABLE h0 ADD COLUMNS(ext0 string);
-ALTER TABLE h0 ADD COLUMNS(new_col int not null comment 'add new column' AFTER
col1);
-ALTER TABLE complex_table ADD COLUMNS(col_struct.col_name string comment 'add
new column to a struct col' AFTER col_from_col_struct);
-```
-
-### Altering Columns
-**Syntax**
-```sql
--- alter table ... alter column
-ALTER TABLE tableName ALTER [COLUMN] col_old_name TYPE column_type [COMMENT]
col_comment[FIRST|AFTER] column_name
-```
-
-**Parameter Description**
-
-| Parameter | Description
|
-|:-----------------|:------------------------------------------------------------------------------------------------------------------------------------------------|
-| tableName | Table name.
|
-| col_old_name | Name of the column to be altered.
|
-| column_type | Type of the target column.
|
-| col_comment | col_comment.
|
-| column_name | New position to place the target column. For example,
*AFTER* **column_name** indicates that the target column is placed after
**column_name**. |
-
-
-**Examples**
-
-```sql
---- Changing the column type
-ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint
-
---- Altering other attributes
-ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment'
-ALTER TABLE table1 ALTER COLUMN a.b.c FIRST
-ALTER TABLE table1 ALTER COLUMN a.b.c AFTER x
-ALTER TABLE table1 ALTER COLUMN a.b.c DROP NOT NULL
-```
-
-**column type change**
-
-| Source\Target | long | float | double | string | decimal | date | int |
-|--------------------|-------|-------|--------|--------|---------|------|-----|
-| int | Y | Y | Y | Y | Y | N | Y |
-| long | Y | N | Y | Y | Y | N | N |
-| float | N | Y | Y | Y | Y | N | N |
-| double | N | N | Y | Y | Y | N | N |
-| decimal | N | N | N | Y | Y | N | N |
-| string | N | N | N | Y | Y | Y | N |
-| date | N | N | N | Y | N | Y | N |
-
-### Deleting Columns
-**Syntax**
-```sql
--- alter table ... drop columns
-ALTER TABLE tableName DROP COLUMN|COLUMNS cols
-```
-
-**Examples**
-
-```sql
-ALTER TABLE table1 DROP COLUMN a.b.c
-ALTER TABLE table1 DROP COLUMNS a.b.c, x, y
-```
-
-### Changing Column Name
-**Syntax**
-```sql
--- alter table ... rename column
-ALTER TABLE tableName RENAME COLUMN old_columnName TO new_columnName
-```
-
-**Examples**
-
-```sql
-ALTER TABLE table1 RENAME COLUMN a.b.c TO x
-```
-
-### Modifying Table Properties
-**Syntax**
-```sql
--- alter table ... set|unset
-ALTER TABLE tableName SET|UNSET tblproperties
-```
-
-**Examples**
-
-```sql
-ALTER TABLE table SET TBLPROPERTIES ('table_property' = 'property_value')
-ALTER TABLE table UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key')
-```
-
-### Changing a Table Name
-**Syntax**
-```sql
--- alter table ... rename
-ALTER TABLE tableName RENAME TO newTableName
-```
-
-**Examples**
-
-```sql
-ALTER TABLE table1 RENAME TO table2
-```
+Schema evolution is a very important aspect of data management and Hudi does
support some of them out of the box,
+and some needs additional configs.
## Out-of-the-box Schema Evolution
-Schema evolution is a very important aspect of data management.
Hudi supports common schema evolution scenarios, such as adding a nullable
field or promoting a datatype of a field, out-of-the-box.
Furthermore, the evolved schema is queryable across engines, such as Presto,
Hive and Spark SQL.
The following table presents a summary of the types of schema changes
compatible with different Hudi table types.
@@ -208,6 +32,8 @@ The following table presents a summary of the types of
schema changes compatible
Let us walk through an example to demonstrate the schema evolution support in
Hudi.
In the below example, we are going to add a new string field and change the
datatype of a field from int to long.
+### Sample runbook
+
```java
Welcome to
____ __
@@ -370,3 +196,180 @@ scala> spark.sql("select rowId, partitionId, preComb,
name, versionId, intToLong
+-----+-----------+-------+-------+---------+---------+----------+
```
+
+## Comprehensive Schema evolution (SparkSQL)
+But based on community needs, we also added support for comprehensive schema
evolution. As of 0.11.0 release, Spark SQL
+(Spark 3.1.x, 3.2.1 and above) DDL support for comprehence Schema evolution
has been added and is experimental.
+
+### Scenarios
+
+1. Columns (including nested columns) can be added, deleted, modified, and
moved.
+2. Partition columns cannot be evolved.
+3. You cannot add, delete, or perform operations on nested columns of the
Array type.
+
+Before using schema evolution, pls set `spark.sql.extensions`. For Spark 3.2.1
and above,
+`spark.sql.catalog.spark_catalog` also need to be set.
+```shell
+# Spark SQL for spark 3.1.x
+spark-sql --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.11.1 \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
+
+# Spark SQL for spark 3.2.1 and above
+spark-sql --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.11.1 \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
+--conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
+
+```
+After start spark-app, pls exec `set hoodie.schema.on.read.enable=true` to
enable schema evolution.
+
+:::note
+Currently, Schema evolution cannot disabled once being enabled.
+:::
+
+:::tip
+When use hive metastore, may encounter a problem:
`org.apache.hadoop.hive.ql.metadata.HiveException`: Unable to alter table. The
following columns have types incompatible with the existing columns in their
respective positions.
+
+Make sure disable `hive.metastore.disallow.incompatible.col.type.changes` in
hive side.
+:::
+
+### Adding Columns
+**Syntax**
+```sql
+-- add columns
+ALTER TABLE tableName ADD COLUMNS(col_spec[, col_spec ...])
+```
+**Parameter Description**
+
+| Parameter | Description
|
+|:-----------------|:---------------------------------------------------------------------------------------------------------------------|
+| tableName | Table name
|
+| col_spec | Column specifications, consisting of five fields,
*col_name*, *col_type*, *nullable*, *comment*, and *col_position*. |
+
+**col_name** : name of the new column. It is mandatory.To add a sub-column to
a nested column, specify the full name of the sub-column in this field.
+
+For example:
+
+1. To add sub-column col1 to a nested struct type column column users
struct<name: string, age: int>, set this field to users.col1.
+
+2. To add sub-column col1 to a nested map type column memeber map<string,
struct<n: string, a: int>>, set this field to member.value.col1.
+
+**col_type** : type of the new column.
+
+**nullable** : whether the new column can be null. The value can be left
empty. Now this field is not used in Hudi.
+
+**comment** : comment of the new column. The value can be left empty.
+
+**col_position** : position where the new column is added. The value can be
*FIRST* or *AFTER* origin_col.
+
+1. If it is set to *FIRST*, the new column will be added to the first column
of the table.
+
+2. If it is set to *AFTER* origin_col, the new column will be added after
original column origin_col.
+
+3. The value can be left empty. *FIRST* can be used only when new sub-columns
are added to nested columns. Do not use *FIRST* in top-level columns. There are
no restrictions about the usage of *AFTER*.
+
+**Examples**
+
+```sql
+ALTER TABLE h0 ADD COLUMNS(ext0 string);
+ALTER TABLE h0 ADD COLUMNS(new_col int not null comment 'add new column' AFTER
col1);
+ALTER TABLE complex_table ADD COLUMNS(col_struct.col_name string comment 'add
new column to a struct col' AFTER col_from_col_struct);
+```
+
+### Altering Columns
+**Syntax**
+```sql
+-- alter table ... alter column
+ALTER TABLE tableName ALTER [COLUMN] col_old_name TYPE column_type [COMMENT]
col_comment[FIRST|AFTER] column_name
+```
+
+**Parameter Description**
+
+| Parameter | Description
|
+|:-----------------|:------------------------------------------------------------------------------------------------------------------------------------------------|
+| tableName | Table name.
|
+| col_old_name | Name of the column to be altered.
|
+| column_type | Type of the target column.
|
+| col_comment | col_comment.
|
+| column_name | New position to place the target column. For example,
*AFTER* **column_name** indicates that the target column is placed after
**column_name**. |
+
+
+**Examples**
+
+```sql
+--- Changing the column type
+ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint
+
+--- Altering other attributes
+ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment'
+ALTER TABLE table1 ALTER COLUMN a.b.c FIRST
+ALTER TABLE table1 ALTER COLUMN a.b.c AFTER x
+ALTER TABLE table1 ALTER COLUMN a.b.c DROP NOT NULL
+```
+
+**column type change**
+
+| Source\Target | long | float | double | string | decimal | date | int |
+|--------------------|-------|-------|--------|--------|---------|------|-----|
+| int | Y | Y | Y | Y | Y | N | Y |
+| long | Y | N | Y | Y | Y | N | N |
+| float | N | Y | Y | Y | Y | N | N |
+| double | N | N | Y | Y | Y | N | N |
+| decimal | N | N | N | Y | Y | N | N |
+| string | N | N | N | Y | Y | Y | N |
+| date | N | N | N | Y | N | Y | N |
+
+### Deleting Columns
+**Syntax**
+```sql
+-- alter table ... drop columns
+ALTER TABLE tableName DROP COLUMN|COLUMNS cols
+```
+
+**Examples**
+
+```sql
+ALTER TABLE table1 DROP COLUMN a.b.c
+ALTER TABLE table1 DROP COLUMNS a.b.c, x, y
+```
+
+### Changing Column Name
+**Syntax**
+```sql
+-- alter table ... rename column
+ALTER TABLE tableName RENAME COLUMN old_columnName TO new_columnName
+```
+
+**Examples**
+
+```sql
+ALTER TABLE table1 RENAME COLUMN a.b.c TO x
+```
+
+### Modifying Table Properties
+**Syntax**
+```sql
+-- alter table ... set|unset
+ALTER TABLE tableName SET|UNSET tblproperties
+```
+
+**Examples**
+
+```sql
+ALTER TABLE table SET TBLPROPERTIES ('table_property' = 'property_value')
+ALTER TABLE table UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key')
+```
+
+### Changing a Table Name
+**Syntax**
+```sql
+-- alter table ... rename
+ALTER TABLE tableName RENAME TO newTableName
+```
+
+**Examples**
+
+```sql
+ALTER TABLE table1 RENAME TO table2
+```