jiangok2006 edited a comment on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-742662159


   I observed slow upsert:
   
   ```
   val options = HudiOptionFactory.create(tableName = tableName,
         tableType = HudiTableType.COW,
         recordKeyField = "ulid",
         partitonPathField = "",
         preCombineField = "sid",
         insertDropDups = false,
         keyGeneratorType = HudiKeyGeneratorType.Nonpartitioned,
         partitionValueExtractorType = 
HudiPartitionValueExtractorType.NonPartitioned
       )
       val helper = HudiHelper(options, log)
   
       /**
        * insert perf
        */
       val listInfo = spark.createDataFrame(Seq(
         (Some(1), Some("11"), Some("111")),
         (Some(2), Some("11"), Some("444"))
       )).toDF("zid", "sid", "ulid")
       var t1 = ZonedDateTime.now
       helper.insert(listInfo, path, saveMode = SaveMode.Overwrite)
       var t2 = ZonedDateTime.now
       // Duration.between(t1, t2).getSeconds is about 5 seconds here
   
       /**
        * upsert perf
        */
       val listInfo2 = spark.createDataFrame(Seq(
         (Some(1), Some("22"), Some("111"))
       )).toDF("zid", "sid", "ulid")
   
       t1 = ZonedDateTime.now
       helper.upsert(
         df = listInfo2,
         path = path
       )
       t2 = ZonedDateTime.now
       // Duration.between(t1, t2).getSeconds is about 131 seconds here
   ```
   This discounts the benefit of using upsert to partial update a big dataset. 
Thanks for any help.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to