[GitHub] [hudi] bhasudha commented on a diff in pull request #9622: [HUDI-6851] Fixing Spark quick start guide

via GitHub Wed, 13 Sep 2023 05:57:15 -0700


bhasudha commented on code in PR #9622:
URL: https://github.com/apache/hudi/pull/9622#discussion_r1324472856



##########
website/docs/quick-start-guide.md:
##########
@@ -246,67 +246,86 @@ Spark SQL needs an explicit create table command.
 
 **Table Concepts**
 
-- Table types
+- **Table types**
 
   Both Hudi's table types, Copy-On-Write (COW) and Merge-On-Read (MOR), can be 
created using Spark SQL.
   While creating the table, table type can be specified using **type** option: 
**type = 'cow'** or **type = 'mor'**.
 
-- Partitioned & Non-Partitioned tables
+- **Partitioned & Non-Partitioned tables**
 
   Users can create a partitioned table or a non-partitioned table in Spark 
SQL. To create a partitioned table, one needs
   to use **partitioned by** statement to specify the partition columns to 
create a partitioned table. When there is
   no **partitioned by** statement with create table command, table is 
considered to be a non-partitioned table.
 
-- Managed & External tables
+- **Primary keyed table**
 
-  In general, Spark SQL supports two kinds of tables, namely managed and 
external. If one specifies a location using **
-  location** statement or use `create external table` to create table 
explicitly, it is an external table, else its
-  considered a managed table. You can read more about external vs managed
-  tables 
[here](https://sparkbyexamples.com/apache-hive/difference-between-hive-internal-tables-and-external-tables/).
+  Optionally users can choose to create a Primary keyed table. When primary 
key is set for a given table, 
+  Hudi ensures uniqueness during updates and deletes. Each record is uniquely 
identified by the primary key configuration. 
+  If primary key is not set, Hudi treats it as key less table and every record 
ingested is treated as a new record even 
+  if contents match. 
 
-*Read more in the [table management](/docs/table_management) guide.*
+:::note
+1. Since Hudi 0.14.0, users can create key less table or primary keyed table 
as per necessity. If 'primaryKey' 
+option is ignored while creating the table, hudi will treat the table as a key 
less table. If user prefer to elect 
+primary keys for a given hudi table, they can do so by using 'primaryKey' 
option while creating the table in spark-sql. 
+4. `primaryKey`, `preCombineField`, and `type` are case-sensitive.
+5. `preCombineField` is required for MOR tables. Generally 'event time' or 
some other similar column will be used for
+   ordering purpose. Hudi will be able to handle out of order data using the 
preCombine field value.
+6. While setting `primaryKey`, `preCombineField`, `type` or other Hudi 
configs, `tblproperties` is preferred over `options`. 
+7. A new Hudi table created by Spark SQL will by default set 
`hoodie.datasource.write.hive_style_partitioning=true`.
+:::
 
 :::note
-1. Since Hudi 0.10.0, `primaryKey` is required. It aligns with Hudi DataSource 
writer’s and resolves behavioural
-   discrepancies reported in previous versions. Non-primary-key tables are no 
longer supported. Any Hudi table created
-   pre-0.10.0 without a `primaryKey` needs to be re-created with a 
`primaryKey` field with 0.10.0.
-2. `primaryKey`, `preCombineField`, and `type` are case-sensitive.
-3. `preCombineField` is required for MOR tables. 
-4. When set `primaryKey`, `preCombineField`, `type` or other Hudi configs, 
`tblproperties` is preferred over `options`. 
-5. A new Hudi table created by Spark SQL will by default set 
`hoodie.datasource.write.hive_style_partitioning=true`.
+For the purpose of quick start guide, we will go with one table type (cow), 
partitioned table and external tables. For more 
+options, please refer to [SQL DDL](/docs/sql_ddl) and DML reference guide.  
 :::
 
-**Create a Non-Partitioned Table**
+**Create Table Properties**
 
-```sql
--- create a cow table, with primaryKey 'uuid' and without preCombineField 
provided
-create table hudi_cow_nonpcf_tbl (
-  uuid int,
-  name string,
-  price double
-) using hudi
-tblproperties (
-  primaryKey = 'uuid'
-);
+Users can set table properties while creating a hudi table. Critical options 
are listed here.
 
+| Parameter Name | Default | Introduction                                      
                                                                                
                                          |
+|------------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| primaryKey | uuid | The primary key field names of the table, multiple 
fields separated by commas. Same as `hoodie.datasource.write.recordkey.field`. 
This can be ignored for a key less table. |
+| preCombineField |  | The pre-combine field of the table. Same as 
`hoodie.datasource.write.precombine.field`                                      
                                                |
+| type       | cow | The table type to create. type = 'cow' means a 
COPY-ON-WRITE table, while type = 'mor' means a MERGE-ON-READ table. Same as 
`hoodie.datasource.write.table.type`            |
 
--- create a mor non-partitioned table with preCombineField provided
-create table hudi_mor_tbl (
-  id int,
+To set any custom hudi config(like index type, max parquet size, etc), see the 
 "Set hudi config section" .
+
+
+Here is an example of creating an COW key less partitioned table.
+
+// bring in all CRUD to the top. 

Review Comment:
   remove this line ?



##########
website/docs/quick-start-guide.md:
##########
@@ -246,67 +246,86 @@ Spark SQL needs an explicit create table command.
 
 **Table Concepts**
 
-- Table types
+- **Table types**
 
   Both Hudi's table types, Copy-On-Write (COW) and Merge-On-Read (MOR), can be 
created using Spark SQL.
   While creating the table, table type can be specified using **type** option: 
**type = 'cow'** or **type = 'mor'**.
 
-- Partitioned & Non-Partitioned tables
+- **Partitioned & Non-Partitioned tables**
 
   Users can create a partitioned table or a non-partitioned table in Spark 
SQL. To create a partitioned table, one needs
   to use **partitioned by** statement to specify the partition columns to 
create a partitioned table. When there is
   no **partitioned by** statement with create table command, table is 
considered to be a non-partitioned table.
 
-- Managed & External tables
+- **Primary keyed table**
 
-  In general, Spark SQL supports two kinds of tables, namely managed and 
external. If one specifies a location using **
-  location** statement or use `create external table` to create table 
explicitly, it is an external table, else its
-  considered a managed table. You can read more about external vs managed
-  tables 
[here](https://sparkbyexamples.com/apache-hive/difference-between-hive-internal-tables-and-external-tables/).
+  Optionally users can choose to create a Primary keyed table. When primary 
key is set for a given table, 
+  Hudi ensures uniqueness during updates and deletes. Each record is uniquely 
identified by the primary key configuration. 
+  If primary key is not set, Hudi treats it as key less table and every record 
ingested is treated as a new record even 
+  if contents match. 
 
-*Read more in the [table management](/docs/table_management) guide.*
+:::note
+1. Since Hudi 0.14.0, users can create key less table or primary keyed table 
as per necessity. If 'primaryKey' 
+option is ignored while creating the table, hudi will treat the table as a key 
less table. If user prefer to elect 
+primary keys for a given hudi table, they can do so by using 'primaryKey' 
option while creating the table in spark-sql. 
+4. `primaryKey`, `preCombineField`, and `type` are case-sensitive.
+5. `preCombineField` is required for MOR tables. Generally 'event time' or 
some other similar column will be used for
+   ordering purpose. Hudi will be able to handle out of order data using the 
preCombine field value.
+6. While setting `primaryKey`, `preCombineField`, `type` or other Hudi 
configs, `tblproperties` is preferred over `options`. 
+7. A new Hudi table created by Spark SQL will by default set 
`hoodie.datasource.write.hive_style_partitioning=true`.
+:::
 
 :::note
-1. Since Hudi 0.10.0, `primaryKey` is required. It aligns with Hudi DataSource 
writer’s and resolves behavioural
-   discrepancies reported in previous versions. Non-primary-key tables are no 
longer supported. Any Hudi table created
-   pre-0.10.0 without a `primaryKey` needs to be re-created with a 
`primaryKey` field with 0.10.0.
-2. `primaryKey`, `preCombineField`, and `type` are case-sensitive.
-3. `preCombineField` is required for MOR tables. 
-4. When set `primaryKey`, `preCombineField`, `type` or other Hudi configs, 
`tblproperties` is preferred over `options`. 
-5. A new Hudi table created by Spark SQL will by default set 
`hoodie.datasource.write.hive_style_partitioning=true`.
+For the purpose of quick start guide, we will go with one table type (cow), 
partitioned table and external tables. For more 
+options, please refer to [SQL DDL](/docs/sql_ddl) and DML reference guide.  
 :::
 
-**Create a Non-Partitioned Table**
+**Create Table Properties**
 
-```sql
--- create a cow table, with primaryKey 'uuid' and without preCombineField 
provided
-create table hudi_cow_nonpcf_tbl (
-  uuid int,
-  name string,
-  price double
-) using hudi
-tblproperties (
-  primaryKey = 'uuid'
-);
+Users can set table properties while creating a hudi table. Critical options 
are listed here.
 
+| Parameter Name | Default | Introduction                                      
                                                                                
                                          |
+|------------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| primaryKey | uuid | The primary key field names of the table, multiple 
fields separated by commas. Same as `hoodie.datasource.write.recordkey.field`. 
This can be ignored for a key less table. |
+| preCombineField |  | The pre-combine field of the table. Same as 
`hoodie.datasource.write.precombine.field`                                      
                                                |
+| type       | cow | The table type to create. type = 'cow' means a 
COPY-ON-WRITE table, while type = 'mor' means a MERGE-ON-READ table. Same as 
`hoodie.datasource.write.table.type`            |
 
--- create a mor non-partitioned table with preCombineField provided
-create table hudi_mor_tbl (
-  id int,
+To set any custom hudi config(like index type, max parquet size, etc), see the 
 "Set hudi config section" .
+
+
+Here is an example of creating an COW key less partitioned table.
+
+// bring in all CRUD to the top. 
+// 

Review Comment:
   remove this line ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] bhasudha commented on a diff in pull request #9622: [HUDI-6851] Fixing Spark quick start guide

Reply via email to