[GitHub] [hudi] vinothchandar commented on a diff in pull request #9712: [HUDI-6851] Fixing Spark quick start guide

via GitHub Sat, 16 Sep 2023 10:00:12 -0700


vinothchandar commented on code in PR #9712:
URL: https://github.com/apache/hudi/pull/9712#discussion_r1327983115



##########
website/docs/quick-start-guide.md:
##########
@@ -7,10 +7,8 @@ last_modified_at: 2023-08-23T21:14:52+09:00
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 
-This guide provides a quick peek at Hudi's capabilities using spark-shell. 
Using Spark datasources, we will walk through
-code snippets that allows you to insert and update a Hudi table of default 
table type:
-[Copy on Write](/docs/table_types#copy-on-write-table). After each write 
operation we will also show how to read the
-data both snapshot and incrementally.
+This guide provides a quick peek at Hudi's capabilities using spark. Using 
Spark datasources, pyspark and Spark SQL, 

Review Comment:
   "datasources" ?
   



##########
website/docs/quick-start-guide.md:
##########
@@ -73,6 +65,13 @@ spark-shell \
   --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
 ```
 ```shell
+# Spark 3.0

Review Comment:
   why is this Spark 3.0 , when the default build is Spark 3.3?



##########
website/docs/quick-start-guide.md:
##########
@@ -216,9 +237,33 @@ can generate sample inserts and updates based on the the 
sample trip schema [her
 
 ## Create Table
 
+Before we go further, few terminologies to familiarize: 
+
+- **Table types**

Review Comment:
   We don't need anything like this here. lets just assume a table type and go 
from there. lets pick MoR vs CoW based what can show case breadth easily, same 
for partitioned vs non-partitioned. 



##########
website/docs/quick-start-guide.md:
##########
@@ -216,9 +237,33 @@ can generate sample inserts and updates based on the the 
sample trip schema [her
 
 ## Create Table
 
+Before we go further, let us go over few terminologies: 
+
+- **Table types**
+
+  Hudi supports two different table types, namely Copy-On-Write (COW) and 
Merge-On-Read (MOR). Users can choose either 
+of these table types depending on their workload and SLA requirements. You can 
read more about different 
+  table types [here](/docs/next/table_types/).
+
+- **Partitioned & Non-Partitioned tables**
+
+  Users can create a partitioned table or a non-partitioned table with Apache 
Hudi. Partitioning can help with 
+  reducing query run times. For quick start purpose, we will go with 
partitioned table. 
+
+- **Primary key and Hudi table**
+
+  Optionally users can choose to create a Primary keyed table. When primary 
key is set for a given table,
+  Hudi ensures uniqueness during updates and deletes. Each record is uniquely 
identified by the primary key configuration.
+  If primary key is not set, Hudi treats it as key less table and every record 
ingested is treated as a new record even
+  if contents match. Such keyless tables are supported from Hudi 0.14.0.

Review Comment:
   and this line talks about 0.14.0 vs 0.13.0?



##########
website/docs/sql_ddl.md:
##########
@@ -0,0 +1,408 @@
+---
+title: SQL DDL
+summary: "In this page, we introduce how to create tables with Hudi."

Review Comment:
   this page is specific to Spark right? it may be confusing to readers if we 
don't qualify all that. 



##########
website/docs/quick-start-guide.md:
##########
@@ -73,6 +65,13 @@ spark-shell \
   --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
 ```
 ```shell
+# Spark 3.0
+spark-shell \
+  --packages org.apache.hudi:hudi-spark3.0-bundle_2.12:0.13.0 \

Review Comment:
   why does this have 0.13.0?



##########
website/docs/sql_ddl.md:
##########
@@ -0,0 +1,408 @@
+---
+title: SQL DDL
+summary: "In this page, we introduce how to create tables with Hudi."
+toc: true
+last_modified_at: 
+---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+The following are SparkSQL DDL actions available:
+
+# Spark Create Table
+:::note
+Only SparkSQL needs an explicit Create Table command. No Create Table command 
is required in Spark when using Scala or 

Review Comment:
   what does scala or python mean? Can we be specific around the Spark apis, 
for e.g  "DataSource APIs (Batch or Streaming) via Scala or Python"



##########
website/docs/quick-start-guide.md:
##########
@@ -216,9 +237,33 @@ can generate sample inserts and updates based on the the 
sample trip schema [her
 
 ## Create Table
 
+Before we go further, let us go over few terminologies: 
+
+- **Table types**
+
+  Hudi supports two different table types, namely Copy-On-Write (COW) and 
Merge-On-Read (MOR). Users can choose either 
+of these table types depending on their workload and SLA requirements. You can 
read more about different 
+  table types [here](/docs/next/table_types/).
+
+- **Partitioned & Non-Partitioned tables**
+
+  Users can create a partitioned table or a non-partitioned table with Apache 
Hudi. Partitioning can help with 
+  reducing query run times. For quick start purpose, we will go with 
partitioned table. 
+
+- **Primary key and Hudi table**
+
+  Optionally users can choose to create a Primary keyed table. When primary 
key is set for a given table,

Review Comment:
   lets not create more artificial distinctions for the tables here. Hudi 
tables always have a key, its a matter of whether its user defined or system 
generated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar commented on a diff in pull request #9712: [HUDI-6851] Fixing Spark quick start guide

Reply via email to