[ 
https://issues.apache.org/jira/browse/SPARK-55416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod KC updated SPARK-55416:
-----------------------------
    Summary: Streaming Python Data Source memory leak when  end offset is not 
updated  (was: Streaming Python Data Source memory leak when is offset not 
updated)

> Streaming Python Data Source memory leak when  end offset is not updated
> ------------------------------------------------------------------------
>
>                 Key: SPARK-55416
>                 URL: https://issues.apache.org/jira/browse/SPARK-55416
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Structured Streaming
>    Affects Versions: 4.1.0, 4.0.0, 4.0.1, 4.0.2, 4.1.1
>            Reporter: Casimir Giesler
>            Priority: Major
>         Attachments: SPARK-55416 repro.py
>
>
> This only becomes a bug / memory leak if users implement the offset 
> incorrectly, never increasing it.
> The commit logic in 
> [datasource_internal.py|https://github.com/apache/spark/blob/master/python/pyspark/sql/datasource_internal.py#L106-L114]
>  fails to clean up cache if the offset never increases.
> In this case, end will always hit at the first cache item, failing the 
> {{end_idx > 0}} condition and not cleaning up cache, while the 
> {{latestOffset}} function will continue to append to the cache.
> This leads to an infinitely growing cache and will ultimately result in OOM.
>  
> Edit: I attached a repro. That the offset is never updated is very obvious in 
> the repro, but in practice, coding errors can also lead to the {{read()}} 
> function returning data but not increasing the offset.
> Maybe we can consider cleaning up cache more aggressively or throwing an 
> error if the same offset is committed twice?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to