[GitHub] [hudi] liuhe0702 edited a comment on pull request #4012: [HUDI-2777] Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.

GitBox Fri, 26 Nov 2021 17:32:20 -0800


liuhe0702 edited a comment on pull request #4012:
URL: https://github.com/apache/hudi/pull/4012#issuecomment-980482343



   > Sorry, still not following . When you say different jobs are started - you 
are referring to the duplicate jobs you see ? tbh given the input 
`writeResult.getWriteStatuses.rdd` is cached already, not sure if its a needle 
mover one way or other.
   
   @vinothchandar 
   
   All transformations in Spark are lazy, in that they do not compute their 
results right away. Instead, they just remember the transformations applied to 
some base dataset (e.g. a file). The transformations are only computed when an 
action requires a result to be returned to the driver program.
   In the scenario we're talking about，the data is written to data files only 
after isEmpty or count is executed. Data in all partitions needs to be written 
to data files during data saving. Therefore, the total number of tasks to be 
executed for the isEmpty mothed are the same with the count mothed.
   Assume that 1000 tasks need to be executed and 1000 computing resources are 
available. The isEmpty method will divide 1000 tasks into six batches and each 
batch has 1, 5, 25, 125, 625 and 219 tasks (
   As shown in the preceding figure, three tasks are divided into 2 batches. 
Each batch contains 1 and 2 tasks). In the isEmpty method, each batch is 
serially executed and the next batch of tasks is executed only after a batch of 
tasks is executed。In the count method, 1000 tasks are executed as a batch. When 
computing resources are sufficient, the time consumed by isEmpty is six times 
that consumed by count.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] liuhe0702 edited a comment on pull request #4012: [HUDI-2777] Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.

Reply via email to