jenu9417 commented on issue #1528: URL: https://github.com/apache/incubator-hudi/issues/1528#issuecomment-616968792
@vinothchandar Thanks for replying in detail. As you pointed out, premature termination of job seems to be the problem. Since this was a POC and dry run, I was using a timer for closing the job after x seconds, which seems to close the job before write phase is finished. But now, the problem is why is write taking more than 40secs, even for as simple as 10 records, where average record size is less than a KB. ``` 91975 [dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 625.0 in stage 19.0 (TID 11130, localhost, executor driver, partition 625, PROCESS_LOCAL, 7193 bytes) 91975 [Executor task launch worker for task 11130] INFO org.apache.spark.executor.Executor - Running task 625.0 in stage 19.0 (TID 11130) 91975 [task-result-getter-0] INFO org.apache.spark.scheduler.TaskSetManager - Finished task 624.0 in stage 19.0 (TID 11129) in 16 ms on localhost (executor driver) (624/1500) ``` From the logs, above set of lines were continuously repeating for multiple times. The stage number was increasing and the same 1500 tasks were run again and again. I presume, these 1500 are partitions in rdd? If so, is it possible/advisable to reduce the number of partitions in the RDD. And what all would be the general suggestions to speed up write here. Happy to provide any other supporting data, if needed. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
