bhasudha commented on code in PR #9622: URL: https://github.com/apache/hudi/pull/9622#discussion_r1324472856
########## website/docs/quick-start-guide.md: ########## @@ -246,67 +246,86 @@ Spark SQL needs an explicit create table command. **Table Concepts** -- Table types +- **Table types** Both Hudi's table types, Copy-On-Write (COW) and Merge-On-Read (MOR), can be created using Spark SQL. While creating the table, table type can be specified using **type** option: **type = 'cow'** or **type = 'mor'**. -- Partitioned & Non-Partitioned tables +- **Partitioned & Non-Partitioned tables** Users can create a partitioned table or a non-partitioned table in Spark SQL. To create a partitioned table, one needs to use **partitioned by** statement to specify the partition columns to create a partitioned table. When there is no **partitioned by** statement with create table command, table is considered to be a non-partitioned table. -- Managed & External tables +- **Primary keyed table** - In general, Spark SQL supports two kinds of tables, namely managed and external. If one specifies a location using ** - location** statement or use `create external table` to create table explicitly, it is an external table, else its - considered a managed table. You can read more about external vs managed - tables [here](https://sparkbyexamples.com/apache-hive/difference-between-hive-internal-tables-and-external-tables/). + Optionally users can choose to create a Primary keyed table. When primary key is set for a given table, + Hudi ensures uniqueness during updates and deletes. Each record is uniquely identified by the primary key configuration. + If primary key is not set, Hudi treats it as key less table and every record ingested is treated as a new record even + if contents match. -*Read more in the [table management](/docs/table_management) guide.* +:::note +1. Since Hudi 0.14.0, users can create key less table or primary keyed table as per necessity. If 'primaryKey' +option is ignored while creating the table, hudi will treat the table as a key less table. If user prefer to elect +primary keys for a given hudi table, they can do so by using 'primaryKey' option while creating the table in spark-sql. +4. `primaryKey`, `preCombineField`, and `type` are case-sensitive. +5. `preCombineField` is required for MOR tables. Generally 'event time' or some other similar column will be used for + ordering purpose. Hudi will be able to handle out of order data using the preCombine field value. +6. While setting `primaryKey`, `preCombineField`, `type` or other Hudi configs, `tblproperties` is preferred over `options`. +7. A new Hudi table created by Spark SQL will by default set `hoodie.datasource.write.hive_style_partitioning=true`. +::: :::note -1. Since Hudi 0.10.0, `primaryKey` is required. It aligns with Hudi DataSource writer’s and resolves behavioural - discrepancies reported in previous versions. Non-primary-key tables are no longer supported. Any Hudi table created - pre-0.10.0 without a `primaryKey` needs to be re-created with a `primaryKey` field with 0.10.0. -2. `primaryKey`, `preCombineField`, and `type` are case-sensitive. -3. `preCombineField` is required for MOR tables. -4. When set `primaryKey`, `preCombineField`, `type` or other Hudi configs, `tblproperties` is preferred over `options`. -5. A new Hudi table created by Spark SQL will by default set `hoodie.datasource.write.hive_style_partitioning=true`. +For the purpose of quick start guide, we will go with one table type (cow), partitioned table and external tables. For more +options, please refer to [SQL DDL](/docs/sql_ddl) and DML reference guide. ::: -**Create a Non-Partitioned Table** +**Create Table Properties** -```sql --- create a cow table, with primaryKey 'uuid' and without preCombineField provided -create table hudi_cow_nonpcf_tbl ( - uuid int, - name string, - price double -) using hudi -tblproperties ( - primaryKey = 'uuid' -); +Users can set table properties while creating a hudi table. Critical options are listed here. +| Parameter Name | Default | Introduction | +|------------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| primaryKey | uuid | The primary key field names of the table, multiple fields separated by commas. Same as `hoodie.datasource.write.recordkey.field`. This can be ignored for a key less table. | +| preCombineField | | The pre-combine field of the table. Same as `hoodie.datasource.write.precombine.field` | +| type | cow | The table type to create. type = 'cow' means a COPY-ON-WRITE table, while type = 'mor' means a MERGE-ON-READ table. Same as `hoodie.datasource.write.table.type` | --- create a mor non-partitioned table with preCombineField provided -create table hudi_mor_tbl ( - id int, +To set any custom hudi config(like index type, max parquet size, etc), see the "Set hudi config section" . + + +Here is an example of creating an COW key less partitioned table. + +// bring in all CRUD to the top. Review Comment: remove this line ? ########## website/docs/quick-start-guide.md: ########## @@ -246,67 +246,86 @@ Spark SQL needs an explicit create table command. **Table Concepts** -- Table types +- **Table types** Both Hudi's table types, Copy-On-Write (COW) and Merge-On-Read (MOR), can be created using Spark SQL. While creating the table, table type can be specified using **type** option: **type = 'cow'** or **type = 'mor'**. -- Partitioned & Non-Partitioned tables +- **Partitioned & Non-Partitioned tables** Users can create a partitioned table or a non-partitioned table in Spark SQL. To create a partitioned table, one needs to use **partitioned by** statement to specify the partition columns to create a partitioned table. When there is no **partitioned by** statement with create table command, table is considered to be a non-partitioned table. -- Managed & External tables +- **Primary keyed table** - In general, Spark SQL supports two kinds of tables, namely managed and external. If one specifies a location using ** - location** statement or use `create external table` to create table explicitly, it is an external table, else its - considered a managed table. You can read more about external vs managed - tables [here](https://sparkbyexamples.com/apache-hive/difference-between-hive-internal-tables-and-external-tables/). + Optionally users can choose to create a Primary keyed table. When primary key is set for a given table, + Hudi ensures uniqueness during updates and deletes. Each record is uniquely identified by the primary key configuration. + If primary key is not set, Hudi treats it as key less table and every record ingested is treated as a new record even + if contents match. -*Read more in the [table management](/docs/table_management) guide.* +:::note +1. Since Hudi 0.14.0, users can create key less table or primary keyed table as per necessity. If 'primaryKey' +option is ignored while creating the table, hudi will treat the table as a key less table. If user prefer to elect +primary keys for a given hudi table, they can do so by using 'primaryKey' option while creating the table in spark-sql. +4. `primaryKey`, `preCombineField`, and `type` are case-sensitive. +5. `preCombineField` is required for MOR tables. Generally 'event time' or some other similar column will be used for + ordering purpose. Hudi will be able to handle out of order data using the preCombine field value. +6. While setting `primaryKey`, `preCombineField`, `type` or other Hudi configs, `tblproperties` is preferred over `options`. +7. A new Hudi table created by Spark SQL will by default set `hoodie.datasource.write.hive_style_partitioning=true`. +::: :::note -1. Since Hudi 0.10.0, `primaryKey` is required. It aligns with Hudi DataSource writer’s and resolves behavioural - discrepancies reported in previous versions. Non-primary-key tables are no longer supported. Any Hudi table created - pre-0.10.0 without a `primaryKey` needs to be re-created with a `primaryKey` field with 0.10.0. -2. `primaryKey`, `preCombineField`, and `type` are case-sensitive. -3. `preCombineField` is required for MOR tables. -4. When set `primaryKey`, `preCombineField`, `type` or other Hudi configs, `tblproperties` is preferred over `options`. -5. A new Hudi table created by Spark SQL will by default set `hoodie.datasource.write.hive_style_partitioning=true`. +For the purpose of quick start guide, we will go with one table type (cow), partitioned table and external tables. For more +options, please refer to [SQL DDL](/docs/sql_ddl) and DML reference guide. ::: -**Create a Non-Partitioned Table** +**Create Table Properties** -```sql --- create a cow table, with primaryKey 'uuid' and without preCombineField provided -create table hudi_cow_nonpcf_tbl ( - uuid int, - name string, - price double -) using hudi -tblproperties ( - primaryKey = 'uuid' -); +Users can set table properties while creating a hudi table. Critical options are listed here. +| Parameter Name | Default | Introduction | +|------------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| primaryKey | uuid | The primary key field names of the table, multiple fields separated by commas. Same as `hoodie.datasource.write.recordkey.field`. This can be ignored for a key less table. | +| preCombineField | | The pre-combine field of the table. Same as `hoodie.datasource.write.precombine.field` | +| type | cow | The table type to create. type = 'cow' means a COPY-ON-WRITE table, while type = 'mor' means a MERGE-ON-READ table. Same as `hoodie.datasource.write.table.type` | --- create a mor non-partitioned table with preCombineField provided -create table hudi_mor_tbl ( - id int, +To set any custom hudi config(like index type, max parquet size, etc), see the "Set hudi config section" . + + +Here is an example of creating an COW key less partitioned table. + +// bring in all CRUD to the top. +// Review Comment: remove this line ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
