[GitHub] [hudi] jonvex commented on a diff in pull request #8875: [HUDI-6311] Spark SQL Insert Into should use bulk insert by default for out of the box experience

via GitHub Mon, 05 Jun 2023 09:02:56 -0700


jonvex commented on code in PR #8875:
URL: https://github.com/apache/hudi/pull/8875#discussion_r1218286181



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala:
##########
@@ -89,6 +89,62 @@ trait ProvidesHoodieConfig extends Logging {
       defaultOpts = defaultOpts, overridingOpts = overridingOpts)
   }
 
+  /**
+   * Get the insert operation.
+   * See if we are able to set bulk insert, else use deduceOperation
+   */
+  private def getOperation(isPartitionedTable: Boolean,
+                           isOverwritePartition: Boolean,
+                           isOverwriteTable: Boolean,
+                           insertModeSet: Boolean,
+                           dropDuplicate: Option[String],
+                           enableBulkInsert: Option[String],
+                           isInsertInto: Boolean,
+                           isNonStrictMode: Boolean,
+                           hasPrecombineColumn: Boolean): String = {
+    val notSetToNonStrict = !insertModeSet || isNonStrictMode
+    //if options are not set, we assume they are configs to do bulk insert
+    (isInsertInto, notSetToNonStrict, enableBulkInsert.getOrElse("true"),
+      dropDuplicate.getOrElse("false"), isOverwritePartition, 
isPartitionedTable) match {
+      case (true, true, "true", "false", false, _) => 
BULK_INSERT_OPERATION_OPT_VAL

Review Comment:
   Consider the case where the user just sets "hoodie.sql.bulk.insert.enable". 
In your suggestion, we would not end up using bulk insert mode, because the 
default of "hoodie.sql.insert.mode" is "upsert". Considering that the 
documentation for the config is "When set to true, the sql insert statement 
will use bulk insert.", I think that the user is trying to use bulk insert. The 
way I made it work is that we assume the user wants to use bulk insert until a 
config is set that is incompatible with bulk insert. In that situation, we then 
fallback on the original logic



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] jonvex commented on a diff in pull request #8875: [HUDI-6311] Spark SQL Insert Into should use bulk insert by default for out of the box experience

Reply via email to