garyli1019 commented on a change in pull request #2296:
URL: https://github.com/apache/hudi/pull/2296#discussion_r543004430
##########
File path:
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
##########
@@ -320,4 +320,21 @@ class TestCOWDataSource extends HoodieClientTestBase {
assertTrue(HoodieDataSourceHelpers.hasNewCommits(fs, basePath, "000"))
}
+
+ @Test def testWithEmptyInput(): Unit = {
+ val inputDF1 =
spark.read.json(spark.sparkContext.parallelize(Seq.empty[String], 1))
+ inputDF1.write.format("org.apache.hudi")
+ .options(commonOpts)
+ .option(DataSourceWriteOptions.OPERATION_OPT_KEY,
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+ .mode(SaveMode.Overwrite)
+ .save(basePath)
+ assertTrue(HoodieDataSourceHelpers.hasNewCommits(fs, basePath, "000"))
Review comment:
I don't think we should make an empty commit. Empty commits might
pollute the timeline, cleaning, compactions e.t.c. and trigger some unexpected
behaviors. WDYT?
##########
File path:
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -182,11 +182,6 @@ private[hudi] object HoodieSparkSqlWriter {
} else {
hoodieAllIncomingRecords
}
-
- if (hoodieRecords.isEmpty()) {
Review comment:
Have you done any experiments on this? Like putting another Spark action
before this.
`rdd.isEmpty()` is the most efficient way I could find that will do the
empty check. Maybe we can change the description of this action so the user
won't misunderstand?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]