jenu9417 commented on issue #1528:
URL: https://github.com/apache/incubator-hudi/issues/1528#issuecomment-616968792


   @vinothchandar 
   Thanks for replying in detail.
   As you pointed out, premature termination of job seems to be the problem. 
Since this was a POC and dry run, I was using a timer for closing the job after 
x seconds, which seems to close the job before write phase is finished.
   
   But now, the problem is why is write taking more than 40secs, even for as 
simple as 10 records, where average record size is less than a KB.
   
   ```
   91975 [dispatcher-event-loop-0] INFO  
org.apache.spark.scheduler.TaskSetManager  - Starting task 625.0 in stage 19.0 
(TID 11130, localhost, executor driver, partition 625, PROCESS_LOCAL, 7193 
bytes)
   91975 [Executor task launch worker for task 11130] INFO  
org.apache.spark.executor.Executor  - Running task 625.0 in stage 19.0 (TID 
11130)
   91975 [task-result-getter-0] INFO  org.apache.spark.scheduler.TaskSetManager 
 - Finished task 624.0 in stage 19.0 (TID 11129) in 16 ms on localhost 
(executor driver) (624/1500)
   ```
   From the logs, above set of lines were continuously repeating for multiple 
times.
   The stage number was increasing and the same 1500 tasks were run again and 
again. I presume, these 1500 are partitions in rdd? If so, is it 
possible/advisable to reduce the number of partitions in the RDD.
   
   And what all would be the general suggestions to speed up write here.
   
   Happy to provide any other supporting data, if needed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to