[jira] [Commented] (HUDI-1425) Performance loss with the additional hoodieRecords.isEmpty() in HoodieSparkSqlWriter#write

ASF GitHub Bot (Jira) Wed, 07 Jul 2021 19:52:06 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376975#comment-17376975
 ]


ASF GitHub Bot commented on HUDI-1425:
--------------------------------------

vinothchandar commented on a change in pull request #2296:
URL: https://github.com/apache/hudi/pull/2296#discussion_r665832216



##########
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java
##########
@@ -932,15 +932,10 @@ public void testFilterDupes() throws Exception {
     ds2.sync();
     mClient = new HoodieTableMetaClient(jsc.hadoopConfiguration(), 
tableBasePath, true);
     HoodieInstant newLastFinished = 
mClient.getCommitsTimeline().filterCompletedInstants().lastInstant().get();
-    
assertTrue(HoodieTimeline.compareTimestamps(newLastFinished.getTimestamp(), 
HoodieTimeline.GREATER_THAN, lastFinished.getTimestamp()
+    // there is not new commit generate for empty commits

Review comment:
       we should actually have this generate an empty commit and test. If we 
don't then we checkpoints won't move. 
   Consider this scenario, when deltastreamer reads from kafka using a custom 
transformer. If the transformer filters out all records from Kafka, we will 
have empty input for write, but the kafka offsets have to move ahead. 

##########
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
##########
@@ -348,4 +348,23 @@ class TestCOWDataSource extends HoodieClientTestBase {
 
     assertTrue(HoodieDataSourceHelpers.hasNewCommits(fs, basePath, "000"))
   }
+
+  @Test def testWithEmptyInput(): Unit = {
+    val inputDF1 = 
spark.read.json(spark.sparkContext.parallelize(Seq.empty[String], 1))
+    inputDF1.write.format("org.apache.hudi")
+      .options(commonOpts)
+      .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+      .mode(SaveMode.Overwrite)
+      .save(basePath)
+    // Empty commit does not has a new commit

Review comment:
       empty input, you mean?

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
##########
@@ -173,6 +173,10 @@ public boolean commitStats(String instantTime, 
List<HoodieWriteStat> stats, Opti
 
   public boolean commitStats(String instantTime, List<HoodieWriteStat> stats, 
Option<Map<String, String>> extraMetadata,
                              String commitActionType, Map<String, 
List<String>> partitionToReplaceFileIds) {
+    // Skip the empty commit
+    if (stats.isEmpty()) {

Review comment:
       lets control this using a new config?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Performance loss with the additional hoodieRecords.isEmpty() in 
> HoodieSparkSqlWriter#write
> ------------------------------------------------------------------------------------------
>
>                 Key: HUDI-1425
>                 URL: https://issues.apache.org/jira/browse/HUDI-1425
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Spark Integration
>    Affects Versions: 0.9.0
>            Reporter: pengzhiwei
>            Assignee: pengzhiwei
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>         Attachments: 截屏2020-11-30 下午9.47.55.png
>
>
> Currently in HoodieSparkSqlWriter#write, there is a _isEmpty()_ test for 
> _hoodieRecords._ This may be a heavy operator in the case when the 
> _hoodieRecords_ contains complex RDD operate.
> !截屏2020-11-30 下午9.47.55.png|width=1255,height=161!
> IMO this test does nothing to do with the performance improve，but rather 
> affects performance.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1425) Performance loss with the additional hoodieRecords.isEmpty() in HoodieSparkSqlWriter#write

Reply via email to