[GitHub] [hudi] nsivabalan commented on a diff in pull request #6675: [HUDI-4759] added validations and some pyspark edits to the quick start guide

GitBox Thu, 15 Sep 2022 12:51:17 -0700


nsivabalan commented on code in PR #6675:
URL: https://github.com/apache/hudi/pull/6675#discussion_r972363764



##########
website/docs/quick-start-guide.md:
##########
@@ -627,22 +665,24 @@ spark.read.
 <TabItem value="python">
 
 ```python
-#pyspark
-spark.read. \
-  format("hudi"). \
-  option("as.of.instant", "20210728141108"). \
-  load(basePath)
+# pyspark
+spark.read.
+format("hudi").
+option("as.of.instant", "20210728141108").
+load(basePath)
 
-spark.read. \
-  format("hudi"). \
-  option("as.of.instant", "2021-07-28 14: 11: 08"). \
-  load(basePath)
+spark.read.
+format("hudi").
+option("as.of.instant", "2021-07-28 14: 11: 08").
+load(basePath)
 
-// It is equal to "as.of.instant = 2021-07-28 00:00:00"
-spark.read. \
-  format("hudi"). \
-  option("as.of.instant", "2021-07-28"). \
-  load(basePath)
+// It is equal

Review Comment:
   can we have comments in a single line. L679 to 681



##########
website/docs/quick-start-guide.md:
##########
@@ -803,30 +859,37 @@ when not matched then
 
 ```python
 # pyspark
+snapshotBeforeUpdate = spark.sql(snapshotQuery)
 updates = 
sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateUpdates(10))
 df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
-df.write.format("hudi"). \
-  options(**hudi_options). \
-  mode("append"). \
-  save(basePath)
+df.write.format("hudi").
+options(**hudi_options).
+mode("append").
+save(basePath)
+assert spark.sql(snapshotQuery).intersect(df).count() == df.count()

Review Comment:
   lets assign spark.sql(snapshotQuery) to another variable and cache it. if 
not, we might keep retriggering the actions.



##########
website/docs/quick-start-guide.md:
##########
@@ -866,27 +929,30 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, 
begin_lat, ts from  hu
 ```python
 # pyspark
 # reload data
-spark. \
-  read. \
-  format("hudi"). \
-  load(basePath). \
-  createOrReplaceTempView("hudi_trips_snapshot")
+spark.
+read.
+format("hudi").
+load(basePath).
+createOrReplaceTempView("hudi_trips_snapshot")
 
-commits = list(map(lambda row: row[0], spark.sql("select 
distinct(_hoodie_commit_time) as commitTime from  hudi_trips_snapshot order by 
commitTime").limit(50).collect()))
-beginTime = commits[len(commits) - 2] # commit time we are interested in
+commits = list(map(lambda row: row[0], spark.sql(
+    "select distinct(_hoodie_commit_time) as commitTime from  
hudi_trips_snapshot order by commitTime").limit(
+    50).collect()))
+beginTime = commits[len(commits) - 2]  # commit time we are interested in
 
 # incrementally query data
 incremental_read_options = {
-  'hoodie.datasource.query.type': 'incremental',
-  'hoodie.datasource.read.begin.instanttime': beginTime,
+    'hoodie.datasource.query.type': 'incremental',
+    'hoodie.datasource.read.begin.instanttime': beginTime,
 }
 
-tripsIncrementalDF = spark.read.format("hudi"). \
-  options(**incremental_read_options). \
-  load(basePath)
+tripsIncrementalDF = spark.read.format("hudi").
+options(**incremental_read_options).
+load(basePath)
 tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
 
-spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  
hudi_trips_incremental where fare > 20.0").show()
+spark.sql(
+    "select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  
hudi_trips_incremental where fare > 20.0").show()

Review Comment:
   why no assertions for incremental query ? 



##########
website/docs/quick-start-guide.md:
##########
@@ -934,22 +1001,23 @@ spark.sql("select `_hoodie_commit_time`, fare, 
begin_lon, begin_lat, ts from hud
 
 ```python
 # pyspark
-beginTime = "000" # Represents all commits > this time.
+beginTime = "000"  # Represents all commits > this time.
 endTime = commits[len(commits) - 2]
 
 # query point in time data
 point_in_time_read_options = {
-  'hoodie.datasource.query.type': 'incremental',
-  'hoodie.datasource.read.end.instanttime': endTime,
-  'hoodie.datasource.read.begin.instanttime': beginTime
+    'hoodie.datasource.query.type': 'incremental',
+    'hoodie.datasource.read.end.instanttime': endTime,
+    'hoodie.datasource.read.begin.instanttime': beginTime
 }
 
-tripsPointInTimeDF = spark.read.format("hudi"). \
-  options(**point_in_time_read_options). \
-  load(basePath)
+tripsPointInTimeDF = spark.read.format("hudi").
+options(**point_in_time_read_options).
+load(basePath)

Review Comment:
   also lets add validations for point in time query as well



##########
website/versioned_docs/version-0.12.0/quick-start-guide.md:
##########
@@ -14,7 +14,8 @@ data both snapshot and incrementally.
 
 ## Setup
 
-Hudi works with Spark-2.4.3+ & Spark 3.x versions. You can follow instructions 
[here](https://spark.apache.org/downloads) for setting up Spark.
+Hudi works with Spark-2.4.3+ & Spark 3.x versions. You can follow

Review Comment:
   I am assuming this has same exact changes as previous file. please address 
whatever comments given above to this file as well 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a diff in pull request #6675: [HUDI-4759] added validations and some pyspark edits to the quick start guide

Reply via email to