[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1594: [HUDI-862] Migrate HUDI site blogs from Confluence to Jeykll.

GitBox Tue, 05 May 2020 08:04:36 -0700


vinothchandar commented on a change in pull request #1594:
URL: https://github.com/apache/incubator-hudi/pull/1594#discussion_r420177909




##########
File path: docs/_posts/2020-01-15-delete-support-in-hudi.md
##########
@@ -0,0 +1,156 @@
+---
+title: "Delete support in Hudi"
+excerpt: "Deletes are supported at a record level in Hudi with 0.5.1 release. 
This blog is a “how to” blog on how to delete records in hudi."
+author: shivnarayan
+---
+
+Deletes are supported at a record level in Hudi with 0.5.1 release. This blog 
is a "how to" blog on how to delete records in hudi. Deletes can be done with 3 
flavors: Hudi RDD APIs, with Spark data source and with DeltaStreamer.
+
+### Delete using RDD Level APIs
+
+If you have embedded  _HoodieWriteClient_ , then deletion is as simple as 
passing in a  _JavaRDD<HoodieKey>_ to the delete api.
+
+    // Fetch list of HoodieKeys from elsewhere that needs to be deleted
+    // convert to JavaRDD if required. JavaRDD<HoodieKey> toBeDeletedKeys
+    List<WriteStatus> statuses = writeClient.delete(toBeDeletedKeys, 
commitTime);
+
+### Deletion with Datasource
+
+Now we will walk through an example of how to perform deletes on a sample 
dataset using the Datasource API. Quick Start has the same example as below. 
Feel free to check it out.
+
+**Step 1** : Launch spark shell
+
+    bin/spark-shell --packages 
org.apache.hudi:hudi-spark-bundle:0.5.1-incubating \
+        --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
+
+**Step 2** : Import as required and set up table name, etc for sample dataset
+
+    import org.apache.hudi.QuickstartUtils._
+    import scala.collection.JavaConversions._
+    import org.apache.spark.sql.SaveMode._
+    import org.apache.hudi.DataSourceReadOptions._
+    import org.apache.hudi.DataSourceWriteOptions._
+    import org.apache.hudi.config.HoodieWriteConfig._
+     
+    val tableName = "hudi_cow_table"
+    val basePath = "file:///tmp/hudi_cow_table"
+    val dataGen = new DataGenerator
+
+**Step 3** : Insert data. Generate some new trips, load them into a DataFrame 
and write the DataFrame into the Hudi dataset as below.
+
+    val inserts = convertToStringList(dataGen.generateInserts(10))
+    val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+    df.write.format("org.apache.hudi").
+        options(getQuickstartWriteConfigs).
+        option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+        option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+        option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+        option(TABLE_NAME, tableName).
+        mode(Overwrite).
+        save(basePath);
+
+**Step 4** : Query data. Load the data files into a DataFrame.
+
+    val roViewDF = spark.
+        read.
+        format("org.apache.hudi").
+        load(basePath + "/*/*/*/*")
+    roViewDF.createOrReplaceTempView("hudi_ro_table")
+    spark.sql("select count(*) from hudi_ro_table").show() // should return 10 
(number of records inserted above)
+    val riderValue = spark.sql("select distinct rider from 
hudi_ro_table").show()
+    // copy the value displayed to be used in next step
+
+**Step 5** : Fetch records that needs to be deleted, with the above rider 
value. This example is just to illustrate how to delete. In real world, use a 
select query using spark sql to fetch records that needs to be deleted and from 
the result we could invoke deletes as given below. Example rider value used is 
"rider-213".
+
+    val df = spark.sql(``"select uuid, partitionPath from hudi_ro_table where 
rider = 'rider-213'"``)
+
+// Replace the above query with any other query that will fetch records to be 
deleted.
+
+**Step 6** : Issue deletes
+
+    val deletes = dataGen.generateDeletes(df.collectAsList())

Review comment:
       could we get code markup for the code blocks? 

##########
File path: docs/_posts/2016-12-30-strata-talk-2017.md
##########
@@ -1,8 +1,7 @@
 ---
 title:  "Connect with us at Strata San Jose March 2017"
+author: admin

Review comment:
       nice touch :) 

##########
File path: docs/_posts/2020-01-15-delete-support-in-hudi.md
##########
@@ -0,0 +1,156 @@
+---
+title: "Delete support in Hudi"
+excerpt: "Deletes are supported at a record level in Hudi with 0.5.1 release. 
This blog is a “how to” blog on how to delete records in hudi."
+author: shivnarayan
+---
+
+Deletes are supported at a record level in Hudi with 0.5.1 release. This blog 
is a "how to" blog on how to delete records in hudi. Deletes can be done with 3 
flavors: Hudi RDD APIs, with Spark data source and with DeltaStreamer.
+
+### Delete using RDD Level APIs
+
+If you have embedded  _HoodieWriteClient_ , then deletion is as simple as 
passing in a  _JavaRDD<HoodieKey>_ to the delete api.
+
+    // Fetch list of HoodieKeys from elsewhere that needs to be deleted
+    // convert to JavaRDD if required. JavaRDD<HoodieKey> toBeDeletedKeys
+    List<WriteStatus> statuses = writeClient.delete(toBeDeletedKeys, 
commitTime);
+
+### Deletion with Datasource
+
+Now we will walk through an example of how to perform deletes on a sample 
dataset using the Datasource API. Quick Start has the same example as below. 
Feel free to check it out.
+
+**Step 1** : Launch spark shell
+
+    bin/spark-shell --packages 
org.apache.hudi:hudi-spark-bundle:0.5.1-incubating \
+        --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
+
+**Step 2** : Import as required and set up table name, etc for sample dataset
+
+    import org.apache.hudi.QuickstartUtils._
+    import scala.collection.JavaConversions._
+    import org.apache.spark.sql.SaveMode._
+    import org.apache.hudi.DataSourceReadOptions._
+    import org.apache.hudi.DataSourceWriteOptions._
+    import org.apache.hudi.config.HoodieWriteConfig._
+     
+    val tableName = "hudi_cow_table"
+    val basePath = "file:///tmp/hudi_cow_table"
+    val dataGen = new DataGenerator
+
+**Step 3** : Insert data. Generate some new trips, load them into a DataFrame 
and write the DataFrame into the Hudi dataset as below.
+
+    val inserts = convertToStringList(dataGen.generateInserts(10))
+    val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+    df.write.format("org.apache.hudi").
+        options(getQuickstartWriteConfigs).
+        option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+        option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+        option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+        option(TABLE_NAME, tableName).
+        mode(Overwrite).
+        save(basePath);
+
+**Step 4** : Query data. Load the data files into a DataFrame.
+
+    val roViewDF = spark.
+        read.
+        format("org.apache.hudi").
+        load(basePath + "/*/*/*/*")
+    roViewDF.createOrReplaceTempView("hudi_ro_table")
+    spark.sql("select count(*) from hudi_ro_table").show() // should return 10 
(number of records inserted above)
+    val riderValue = spark.sql("select distinct rider from 
hudi_ro_table").show()
+    // copy the value displayed to be used in next step
+
+**Step 5** : Fetch records that needs to be deleted, with the above rider 
value. This example is just to illustrate how to delete. In real world, use a 
select query using spark sql to fetch records that needs to be deleted and from 
the result we could invoke deletes as given below. Example rider value used is 
"rider-213".
+
+    val df = spark.sql(``"select uuid, partitionPath from hudi_ro_table where 
rider = 'rider-213'"``)
+
+// Replace the above query with any other query that will fetch records to be 
deleted.
+
+**Step 6** : Issue deletes
+
+    val deletes = dataGen.generateDeletes(df.collectAsList())

Review comment:
       
https://github.com/apache/incubator-hudi/blame/asf-site/docs/_docs/1_1_quick_start_guide.md#L19
  similar to the other pages.. ? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1594: [HUDI-862] Migrate HUDI site blogs from Confluence to Jeykll.

Reply via email to