[GitHub] [hudi] nsivabalan commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

via GitHub Wed, 05 Jul 2023 12:12:06 -0700


nsivabalan commented on code in PR #9123:
URL: https://github.com/apache/hudi/pull/9123#discussion_r1253522263



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##########
@@ -514,6 +520,29 @@ object DataSourceWriteOptions {
 
   val RECONCILE_SCHEMA: ConfigProperty[java.lang.Boolean] = 
HoodieCommonConfig.RECONCILE_SCHEMA
 
+  val SQL_WRITE_OPERATION: ConfigProperty[String] = ConfigProperty
+    .key("hoodie.sql.write.operation")
+    .defaultValue("insert")
+    .withDocumentation("Sql write operation to use with INSERT_INTO spark sql 
command. This comes with 3 possible values, bulk_insert, " +
+      "insert and upsert. bulk_insert is generally meant for initial loads and 
is known to be performant compared to insert. But bulk_insert may not " +
+      "do small file managmeent. If you prefer hudi to automatically managee 
small files, then you can go with \"insert\". There is no precombine " +
+      "(if there are duplicates within the same batch being ingested, same 
dups will be ingested) with bulk_insert and insert and there is no index " +
+      "look up as well. If you may use INSERT_INTO for mutable dataset, then 
you may have to set this config value to \"upsert\". With upsert, you will " +
+      "get both precombine and updates to existing records on storage is also 
honored. If not, you may see duplicates. ")
+
+  val NONE_INSERT_DUP_POLICY = "none"
+  val DROP_INSERT_DUP_POLICY = "drop"
+  val FAIL_INSERT_DUP_POLICY = "fail"
+
+  val INSERT_DUP_POLICY: ConfigProperty[String] = ConfigProperty
+    .key("hoodie.datasource.insert.dup.policy")

Review Comment:
   dedup kind of overlaps w/ combine.before.insert. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

Reply via email to