[GitHub] [hudi] garyli1019 commented on a change in pull request #2296: [HUDI-1425] Performance loss with the additional hoodieRecords.isEmpt…

GitBox Mon, 14 Dec 2020 18:56:02 -0800


garyli1019 commented on a change in pull request #2296:
URL: https://github.com/apache/hudi/pull/2296#discussion_r543004430




##########
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
##########
@@ -320,4 +320,21 @@ class TestCOWDataSource extends HoodieClientTestBase {
 
     assertTrue(HoodieDataSourceHelpers.hasNewCommits(fs, basePath, "000"))
   }
+
+  @Test def testWithEmptyInput(): Unit = {
+    val inputDF1 = 
spark.read.json(spark.sparkContext.parallelize(Seq.empty[String], 1))
+    inputDF1.write.format("org.apache.hudi")
+      .options(commonOpts)
+      .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+      .mode(SaveMode.Overwrite)
+      .save(basePath)
+    assertTrue(HoodieDataSourceHelpers.hasNewCommits(fs, basePath, "000"))

Review comment:
       I don't think we should make an empty commit. Empty commits might 
pollute the timeline, cleaning, compactions e.t.c. and trigger some unexpected 
behaviors. WDYT?

##########
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -182,11 +182,6 @@ private[hudi] object HoodieSparkSqlWriter {
             } else {
               hoodieAllIncomingRecords
             }
-
-          if (hoodieRecords.isEmpty()) {

Review comment:
       Have you done any experiments on this? Like putting another Spark action 
before this. 
   `rdd.isEmpty()` is the most efficient way I could find that will do the 
empty check. Maybe we can change the description of this action so the user 
won't misunderstand? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] garyli1019 commented on a change in pull request #2296: [HUDI-1425] Performance loss with the additional hoodieRecords.isEmpt…

Reply via email to