[ https://issues.apache.org/jira/browse/SPARK-33753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
dzcxzl updated SPARK-33753: --------------------------- Attachment: current_job_finish_time.png > Reduce the memory footprint and gc of the cache (hadoopJobMetadata) > ------------------------------------------------------------------- > > Key: SPARK-33753 > URL: https://issues.apache.org/jira/browse/SPARK-33753 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.0.1 > Reporter: dzcxzl > Priority: Minor > Attachments: current_job_finish_time.png > > > HadoopRDD uses soft-reference map to cache jobconf (rdd_id -> jobconf). > When the number of hive partitions read by the driver is large, > HadoopRDD.getPartitions will create many jobconfs and add them to the cache. > The executor will also create a jobconf, add it to the cache, and share it > among exeuctors. > The number of jobconfs in the driver cache increases the memory pressure. > When the driver memory configuration is not high, full gc will be frequently > used, and these jobconfs are hardly reused. > For example, spark.driver.memory=2560m, the read partition is about 14,000, > and a jobconf 96kb. > The following is a repair comparison, full gc decreased from 62s to 0.8s, and > the number of times decreased from 31 to 5. And the driver applied for less > memory (Old Gen 1.667G->968M). > > Current: > !image-2020-12-11-16-17-28-991.png! > jstat -gcutil PID 2s > !image-2020-12-11-16-08-53-656.png! > !image-2020-12-11-16-10-07-363.png! > > Try to change softValues to weakValues > !image-2020-12-11-16-11-26-673.png! > !image-2020-12-11-16-11-35-988.png! > !image-2020-12-11-16-12-22-035.png! > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org