[
https://issues.apache.org/jira/browse/SPARK-55416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Casimir Giesler updated SPARK-55416:
------------------------------------
Attachment: SPARK-55416 repro.py
> Streaming Python Data Source memory leak when is offset not updated
> -------------------------------------------------------------------
>
> Key: SPARK-55416
> URL: https://issues.apache.org/jira/browse/SPARK-55416
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Structured Streaming
> Affects Versions: 4.1.0, 4.0.0, 4.0.1, 4.0.2, 4.1.1
> Reporter: Casimir Giesler
> Priority: Major
> Attachments: SPARK-55416 repro.py
>
>
> This only becomes a bug / memory leak if users implement the offset
> incorrectly, never increasing it.
> The commit logic in
> [datasource_internal.py|https://github.com/apache/spark/blob/master/python/pyspark/sql/datasource_internal.py#L106-L114]
> fails to clean up cache if the offset never increases.
> In this case, end will always hit at the first cache item, failing the
> {{end_idx > 0}} condition and not cleaning up cache, while the
> {{latestOffset}} function will continue to append to the cache.
> This leads to an infinitely growing cache and will ultimately result in OOM.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]