[
https://issues.apache.org/jira/browse/SPARK-29181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933926#comment-16933926
]
Dongjoon Hyun commented on SPARK-29181:
---------------------------------------
Hi, [~viirya]. Is this a duplicate of SPARK-29182? Then, could you close this
since you make a PR there.
> Cache preferred locations of checkpointed RDD
> ---------------------------------------------
>
> Key: SPARK-29181
> URL: https://issues.apache.org/jira/browse/SPARK-29181
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.0.0
> Reporter: Liang-Chi Hsieh
> Priority: Major
>
> One Spark job in our cluster fits many ALS models in parallel. The fitting
> goes well, but in next when we union all factors, the union operation is very
> slow.
> By looking into the driver stack dump, looks like the driver spends a lot of
> time on computing preferred locations. As we checkpoint training data before
> fitting ALS, the time is spent on
> ReliableCheckpointRDD.getPreferredLocations. In this method, it will call DFS
> interface to query file status and block locations. As we have big number of
> partitions derived from the checkpointed RDD, the union will spend a lot of
> time on querying the same information.
> This proposes to add a Spark config to control the caching behavior of
> ReliableCheckpointRDD.getPreferredLocations. If it is enabled,
> getPreferredLocations will only compute preferred locations once and cache it
> for late usage.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]