[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

GitBox Mon, 27 Apr 2020 14:33:55 -0700


lamber-ken edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-620246831



   ## The lastest spark log (master branch)
   
   ### Simplified origin spark log
   ```
   // Upsert part
   Warning: Ignoring non-spark config property: 
"spark.sql.hive.convertMetastoreParquet=false"
   Warning: Ignoring non-spark config property: 
"spark.serializer=org.apache.spark.serializer.KryoSerializer"
   20/04/26 17:35:58 WARN S3CryptoModuleAE: Unable to detect encryption 
information for object 'jar' in bucket '<storageBucket>'. Returning object 
without decryption.
   
   // Start rollback
   20/04/26 17:36:34 INFO HoodieWriteClient: Begin rollback of instant 
20200426170940
   20/04/26 17:36:34 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from <storageLocation>
   20/04/26 17:36:34 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://ip-10-0-55-99.ec2.internal:8020]
   
   // Output repeatly 
   20/04/26 17:36:40 INFO FileStatusExt: Metadata Entry doesn't exist
   ...
   20/04/26 18:19:07 INFO FileStatusExt: Metadata Entry doesn't exist
   
   // Finish
   20/04/26 18:19:08 INFO CleanActionExecutor: Total Partitions to clean : 
6052, with policy KEEP_LATEST_COMMITS
   20/04/26 18:19:08 INFO CleanActionExecutor: Using cleanerParallelism: 6000
   20/04/26 18:19:08 INFO SparkContext: Starting job: collect at 
CleanActionExecutor.java:87
   20/04/26 18:19:08 INFO DAGScheduler: Got job 10 (collect at 
CleanActionExecutor.java:87) with 6000 output partitions
   20/04/26 18:19:08 INFO DAGScheduler: Final stage: ResultStage 30 (collect at 
CleanActionExecutor.java:87)
   20/04/26 18:19:08 INFO DAGScheduler: Parents of final stage: List()
   20/04/26 18:19:08 INFO DAGScheduler: Missing parents: List()
   20/04/26 18:19:08 INFO DAGScheduler: Submitting ResultStage 30 
(MapPartitionsRDD[63] at map at CleanActionExecutor.java:86), which has no 
missing parents
   20/04/26 18:19:08 INFO MemoryStore: Block broadcast_20 stored as values in 
memory (estimated size 247.7 KB, free 1008.3 MB)
   20/04/26 18:19:08 INFO MemoryStore: Block broadcast_20_piece0 stored as 
bytes in memory (estimated size 87.1 KB, free 1008.2 MB)
   20/04/26 18:19:08 INFO BlockManagerInfo: Added broadcast_20_piece0 in memory 
on ip-10-0-55-99.ec2.internal:44415 (size: 87.1 KB, free: 1008.8 MB)
   20/04/26 18:19:08 INFO SparkContext: Created broadcast 20 from broadcast at 
DAGScheduler.scala:1201
   20/04/26 18:19:08 INFO DAGScheduler: Submitting 6000 missing tasks from 
ResultStage 30 (MapPartitionsRDD[63] at map at CleanActionExecutor.java:86) 
(first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 
12, 13, 14))
   20/04/26 18:19:08 INFO YarnScheduler: Adding task set 30.0 with 6000 tasks
   20/04/26 18:19:23 INFO YarnScheduler: Removed TaskSet 32.0, whose tasks have 
all completed, from pool 
   20/04/26 18:19:23 INFO DAGScheduler: ResultStage 32 (collect at 
CleanActionExecutor.java:155) finished in 3.444 s
   20/04/26 18:19:23 INFO DAGScheduler: Job 11 finished: collect at 
CleanActionExecutor.java:155, took 8.313284 s
   20/04/26 18:19:23 INFO HoodieActiveTimeline: Loaded instants 
[[20200425201231__clean__COMPLETED], [20200425201231__commit__COMPLETED], 
[==>20200426173625__clean__INFLIGHT], [20200426173625__commit__COMPLETED]]
   20/04/26 18:19:24 INFO HoodieActiveTimeline: Checking for file exists 
?<storageLocation>.hoodie/20200426173625.clean.inflight
   20/04/26 18:19:24 INFO CSEMultipartUploadOutputStream: close closed:false 
<storageLocation>.hoodie/20200426173625.clean
   20/04/26 18:19:24 INFO DefaultMultipartUploadDispatcher: Completed multipart 
upload of 1 parts 273450 bytes
   20/04/26 18:19:24 INFO CSEMultipartUploadOutputStream: Finished uploading 
<storageLocation>.hoodie/20200426173625.clean. Elapsed seconds: 0.
   20/04/26 18:19:24 INFO HoodieActiveTimeline: Create new file for toInstant 
?<storageLocation>.hoodie/20200426173625.clean
   20/04/26 18:19:24 INFO CleanActionExecutor: Marked clean started on 
20200426173625 as complete
   20/04/26 18:19:24 INFO AbstractHoodieWriteClient: Committed 20200426173625
   20/04/26 18:19:24 INFO HoodieSparkSqlWriter$: Commit 20200426173625 
successful!
   20/04/26 18:19:28 INFO SparkContext: Invoking stop() from shutdown hook
   20/04/26 18:19:28 INFO SparkUI: Stopped Spark web UI at ip
   20/04/26 18:19:28 INFO YarnClientSchedulerBackend: Interrupting monitor 
thread
   20/04/26 18:19:28 INFO YarnClientSchedulerBackend: Shutting down all 
executors
   20/04/26 18:19:28 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each 
executor to shut down
   20/04/26 18:19:28 INFO SchedulerExtensionServices: Stopping 
SchedulerExtensionServices
   (serviceOption=None,
    services=List(),
    started=false)
   20/04/26 18:19:28 INFO YarnClientSchedulerBackend: Stopped
   20/04/26 18:19:28 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
   20/04/26 18:19:28 INFO MemoryStore: MemoryStore cleared
   20/04/26 18:19:28 INFO BlockManager: BlockManager stopped
   20/04/26 18:19:28 INFO BlockManagerMaster: BlockManagerMaster stopped
   20/04/26 18:19:28 INFO 
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
   20/04/26 18:19:28 INFO SparkContext: Successfully stopped SparkContext
   20/04/26 18:19:28 INFO ShutdownHookManager: Shutdown hook called
   20/04/26 18:19:28 INFO ShutdownHookManager: Deleting directory 
/mnt/tmp/spark-e12c26c7-150a-4837-8fb2-72db585dd2f3
   20/04/26 18:19:28 INFO ShutdownHookManager: Deleting directory 
/mnt/tmp/spark-1a4b587a-a177-4cab-8f93-6849d64386d9
   Command exiting with ret '0'
   ```
   
   #### Analysis
   
   
![image](https://user-images.githubusercontent.com/20113411/80422808-6b8be100-8911-11ea-9dd1-bfa6b0bafd9c.png)
   
   ### Analysis about step 2
   
   `Rollback` cost 42mins, output `FileStatusExt: Metadata Entry doesn't exist` 
repeatly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Reply via email to