liuhe0702 edited a comment on pull request #4012: URL: https://github.com/apache/hudi/pull/4012#issuecomment-980482343
> Sorry, still not following . When you say different jobs are started - you are referring to the duplicate jobs you see ? tbh given the input `writeResult.getWriteStatuses.rdd` is cached already, not sure if its a needle mover one way or other. @vinothchandar All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. In the scenario we're talking about,the data is written to data files only after isEmpty or count is executed. Data in all partitions needs to be written to data files during data saving. Therefore, the total number of tasks to be executed for the isEmpty mothed are the same with the count mothed. Assume that 1000 tasks need to be executed and 1000 computing resources are available. The isEmpty method will divide 1000 tasks into six batches and each batch has 1, 5, 25, 125, 625 and 219 tasks ( As shown in the preceding figure, three tasks are divided into 2 batches. Each batch contains 1 and 2 tasks). In the isEmpty method, each batch is serially executed and the next batch of tasks is executed only after a batch of tasks is executed。In the count method, 1000 tasks are executed as a batch. When computing resources are sufficient, the time consumed by isEmpty is six times that consumed by count. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
