[GitHub] [hudi] nsivabalan commented on a diff in pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases

via GitHub Thu, 03 Aug 2023 07:38:26 -0700


nsivabalan commented on code in PR #8697:
URL: https://github.com/apache/hudi/pull/8697#discussion_r1283304161



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##########
@@ -429,6 +416,40 @@ object HoodieSparkSqlWriter {
     }
   }
 
+  def deduceOperation(hoodieConfig: HoodieConfig, paramsWithoutDefaults : 
Map[String, String]): WriteOperationType = {
+    var operation = 
WriteOperationType.fromValue(hoodieConfig.getString(OPERATION))
+    // TODO clean up
+    // It does not make sense to allow upsert() operation if INSERT_DROP_DUPS 
is true
+    // Auto-correct the operation to "insert" if OPERATION is set to "upsert" 
wrongly
+    // or not set (in which case it will be set as "upsert" by 
parametersWithWriteDefaults()) .
+    if (hoodieConfig.getBoolean(INSERT_DROP_DUPS) &&
+      operation == WriteOperationType.UPSERT) {
+
+      log.warn(s"$UPSERT_OPERATION_OPT_VAL is not applicable " +
+        s"when $INSERT_DROP_DUPS is set to be true, " +
+        s"overriding the $OPERATION to be $INSERT_OPERATION_OPT_VAL")
+
+      operation = WriteOperationType.INSERT
+      operation
+    } else {
+      // if no record key, no preCombine, we should treat it as append only 
workload
+      // and make bulk_insert as operation type.
+      if 
(!paramsWithoutDefaults.containsKey(DataSourceWriteOptions.RECORDKEY_FIELD.key())
+        && 
!paramsWithoutDefaults.containsKey(DataSourceWriteOptions.PRECOMBINE_FIELD.key())
+        && !paramsWithoutDefaults.containsKey(OPERATION.key())) {
+        log.warn(s"Choosing BULK_INSERT as the operation type since auto 
record key generation is applicable")
+        operation = WriteOperationType.BULK_INSERT
+      }
+      // if no record key is set, will switch the default operation to INSERT 
(auto record key gen)
+      else if 
(!hoodieConfig.contains(DataSourceWriteOptions.RECORDKEY_FIELD.key())
+        && !paramsWithoutDefaults.containsKey(OPERATION.key())) {
+        log.warn(s"Choosing INSERT as the operation type since auto record key 
generation is applicable")
+        operation = WriteOperationType.INSERT

Review Comment:
   hey @xushiyan : I do hear your point. Here we are mainly simplifying a user 
who is comes from writing to parquet table to hudi table. thats why we are 
looking into the mandatory fields like record key, precombine. Can we go ahead 
and land this in for 0.14.0. 
   we can continue our discussion on how to simplify further. bcoz, very likely 
if someone is setting file sizing, they know hudi to some extent and not just 
trying to replace df.write.parquet w/ df.write.hudi. we can jot down few more 
use-cases and come up w/ a holistic plan 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases

Reply via email to