[ https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759821#comment-17759821 ]
Yauheni Audzeichyk commented on SPARK-44900: -------------------------------------------- [~yxzhang] looks like it is just disk usage tracking issue as disk space is not used as much. However it affects effectiveness of cached data since Spark spills it to disk as it believes it doesn't fit memory anymore so eventually it becomes 100% stored on disk. > Cached DataFrame keeps growing > ------------------------------ > > Key: SPARK-44900 > URL: https://issues.apache.org/jira/browse/SPARK-44900 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.3.0 > Reporter: Varun Nalla > Priority: Blocker > > Scenario : > We have a kafka streaming application where the data lookups are happening by > joining another DF which is cached, and the caching strategy is > MEMORY_AND_DISK. > However the size of the cached DataFrame keeps on growing for every micro > batch the streaming application process and that's being visible under > storage tab. > A similar stack overflow thread was already raised. > https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org