[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759821#comment-17759821
 ] 

Yauheni Audzeichyk commented on SPARK-44900:
--------------------------------------------

[~yxzhang] looks like it is just disk usage tracking issue as disk space is not 
used as much.

However it affects effectiveness of cached data since Spark spills it to disk 
as it believes it doesn't fit memory anymore so eventually it becomes 100% 
stored on disk.

> Cached DataFrame keeps growing
> ------------------------------
>
>                 Key: SPARK-44900
>                 URL: https://issues.apache.org/jira/browse/SPARK-44900
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.3.0
>            Reporter: Varun Nalla
>            Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to