[jira] [Commented] (SPARK-43408) Spark caching in the context of a single job

Hyukjin Kwon (Jira) Sun, 14 May 2023 18:58:44 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-43408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722612#comment-17722612
 ]


Hyukjin Kwon commented on SPARK-43408:
--------------------------------------

if you want to cache your data across multiple jobs, you should use checkpoint.
If you intend to cache your data within one job, you should use cache.

Caching in the disk is similar with checkpoint so I would expect to have 
similar perf.

> Spark caching in the context of a single job
> --------------------------------------------
>
>                 Key: SPARK-43408
>                 URL: https://issues.apache.org/jira/browse/SPARK-43408
>             Project: Spark
>          Issue Type: Question
>          Components: Shuffle
>    Affects Versions: 3.3.1
>            Reporter: Faiz Halde
>            Priority: Trivial
>
> Does caching benefit a spark job with only a single action in it? Spark IIRC 
> already optimizes shuffles by persisting them onto the disk
> I am unable to find a counter-example where caching would benefit a job with 
> a single action. In every case I can think of, the shuffle checkpoint acts as 
> a good enough caching mechanism in itself
> FWIW, I am talking specifically in the context of the Dataframe API. The 
> StorageLevel allowed in my case is DISK_ONLY i.e. I am not looking to speed 
> up by caching data in memory
> To rephrase, is DISK_ONLY caching better or same as shuffle checkpointing in 
> the context of a single action



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-43408) Spark caching in the context of a single job

Reply via email to