Re: [PR] docs: Add release 1.2.0 documentation [hudi]

via GitHub Fri, 29 May 2026 00:59:30 -0700


yihua commented on code in PR #18880:
URL: https://github.com/apache/hudi/pull/18880#discussion_r3322964920



##########
website/versioned_docs/version-1.2.0/sql_ddl.md:
##########
@@ -0,0 +1,1218 @@
+---
+title: SQL DDL
+summary: "In this page, we discuss using SQL DDL commands with Hudi"
+toc: true
+last_modified_at: 2026-05-29T00:00:00-00:00
+---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+This page describes support for creating and altering tables using SQL across 
various engines. 
+
+## Spark SQL
+
+### Create table 
+
+You can create tables using standard CREATE TABLE syntax, which supports 
partitioning and passing table properties.
+
+```sql
+CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
+  [(col_name data_type [COMMENT col_comment], ...)]
+  [COMMENT table_comment]
+  [PARTITIONED BY (col_name, ...)]
+  [ROW FORMAT row_format]
+  [STORED AS file_format]
+  [LOCATION path]
+  [TBLPROPERTIES (property_name=property_value, ...)]
+  [AS select_statement];
+```
+
+:::note NOTE:
+For users running this tutorial locally and have a Spark-Hive(HMS) integration 
in their environment: If you use 
+`default` database or if you don't provide `[LOCATION path]` with the DDL 
statement, Spark will return 
+`java.io.IOException: Mkdirs failed to create 
file:/user/hive/warehouse/hudi_table/.hoodie` error. 
+To get around this, you can follow either of the two options mentioned below:
+1. Create a database i.e. `CREATE DATABASE hudidb;` and use it i.e. `USE 
hudidb;` before running the DDL statement.
+2. Or provide a path using `LOCATION` keyword to persist the data with the DDL 
statement.
+:::
+
+### Create non-partitioned table
+
+Creating a non-partitioned table is as simple as creating a regular table.
+
+```sql
+-- create a Hudi table
+CREATE TABLE IF NOT EXISTS hudi_table (
+  id INT,
+  name STRING,
+  price DOUBLE
+) USING hudi;
+```
+
+### Create partitioned table
+A partitioned table can be created by adding a `partitioned by` clause. 
Partitioning helps to organize the data into multiple folders 
+based on the partition columns. It can also help speed up queries and index 
lookups by limiting the amount of metadata, index and data scanned.
+
+```sql
+CREATE TABLE IF NOT EXISTS hudi_table_partitioned (
+  id BIGINT,
+  name STRING,
+  dt STRING,
+  hh STRING
+) USING hudi
+TBLPROPERTIES (
+  type = 'cow'
+)
+PARTITIONED BY (dt);
+```
+
+:::note
+You can also create a table partitioned by multiple fields by supplying 
comma-separated field names.
+When creating a table partitioned by multiple fields, ensure that you specify 
the columns in the `PARTITIONED BY` clause 
+in the same order as they appear in the `CREATE TABLE` schema. For example, 
for the above table, the partition fields 
+should be specified as `PARTITIONED BY (dt, hh)`.
+:::
+
+### Create table with record keys and ordering fields
+
+As discussed [here](quick-start-guide.md#keys), tables track each record in 
the table using a record key. Hudi auto-generated a highly compressed 
+key for each new record in the examples so far. If you want to use an existing 
field as the key, you can set the `primaryKey` option. 
+Typically, this is also accompanied by configuring ordering fields (via 
`orderingFields` option) to deal with out-of-order data and potential 
+duplicate records with the same key in the incoming writes.
+
+:::note
+You can choose multiple fields as primary keys for a given table on a need 
basis. For eg, "primaryKey = 'id, name'", and
+this materializes a composite key of the two fields, which can be useful for 
exploring the table.
+:::
+
+Here is an example of creating a table using both options. Typically, a field 
that denotes the time of the event or
+fact, e.g., order creation time, event generation time etc., is used as the 
ordering field (via `orderingFields`). Hudi resolves multiple versions
+of the same record by ordering based on this field when queries are run on the 
table.
+
+```sql
+CREATE TABLE IF NOT EXISTS hudi_table_keyed (
+  id INT,
+  name STRING,
+  price DOUBLE,
+  ts BIGINT
+) USING hudi
+TBLPROPERTIES (
+  type = 'cow',
+  primaryKey = 'id',
+  orderingFields = 'ts'
+);
+```
+
+### Create table with merge modes {#create-table-with-record-merge-mode}
+
+Hudi supports different [record merge modes](record_merger.md) to handle merge 
of incoming records with existing
+records. To create a table with specific record merge mode, you can set 
`recordMergeMode` option.
+
+```sql
+CREATE TABLE IF NOT EXISTS hudi_table_merge_mode (
+  id INT,
+  name STRING,
+  ts LONG,
+  price DOUBLE
+) USING hudi
+TBLPROPERTIES (
+  type = 'mor',
+  primaryKey = 'id',
+  orderingFields = 'ts',
+  recordMergeMode = 'EVENT_TIME_ORDERING'
+)
+LOCATION 'file:///tmp/hudi_table_merge_mode/';
+```
+
+With `EVENT_TIME_ORDERING`, the record with the larger event time (specified 
via `orderingFields`) overwrites the record with the
+smaller event time on the same key, regardless of transaction's commit time. 
Users can set `CUSTOM` mode to provide their own
+merge logic. With `CUSTOM` merge mode, you can provide a custom class that 
implements the merge logic. The interfaces 
+to implement is explained in detail [here](record_merger.md#custom).
+
+```sql
+CREATE TABLE IF NOT EXISTS hudi_table_merge_mode_custom (
+  id INT,
+  name STRING,
+  ts LONG,
+  price DOUBLE
+) USING hudi
+TBLPROPERTIES (
+  type = 'mor',
+  primaryKey = 'id',
+  orderingFields = 'ts',
+  recordMergeMode = 'CUSTOM',
+  'hoodie.record.merge.strategy.id' = '<unique-uuid>'
+)
+LOCATION 'file:///tmp/hudi_table_merge_mode_custom/';
+```
+
+### Create table with unstructured and semi-structured column types 
{#create-table-with-unstructured-and-semi-structured-column-types}
+
+Hudi supports three column types for unstructured and semi-structured data, 
plus a Lance base file
+format.
+
+#### `VECTOR(dim[, elementType])` {#vector}
+
+A fixed-dimension embedding column. `elementType` is one of `FLOAT` (default), 
`DOUBLE`, or `INT8`
+(alias `BYTE`):
+
+| Element type | Storage |
+|:-------------|:--------|
+| `FLOAT` (default) | `ArrayType(FloatType)` |
+| `DOUBLE` | `ArrayType(DoubleType)` |
+| `INT8` / `BYTE` | `ArrayType(ByteType)` |
+
+```sql
+CREATE TABLE products (
+    product_id   STRING,
+    name         STRING,
+    embedding    VECTOR(768)          -- defaults to FLOAT
+) USING hudi
+TBLPROPERTIES (
+    primaryKey = 'product_id',
+    type = 'cow',
+    hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger'
+);

Review Comment:
   This is a new feature thus the new docs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] docs: Add release 1.2.0 documentation [hudi]

Reply via email to