[
https://issues.apache.org/jira/browse/SPARK-40927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mihir Kelkar updated SPARK-40927:
---------------------------------
Description:
In Pyspark Structured streaming with Kafka as source and sink, the driver as
well as the executors seem to get OOM killed after a long period of time
(8-12hrs). Not able to pinpoint to any specific thing. Prometheus metrics show
that -
# JVM Off-heap memory of both driver and executors keep on increasing over
time (12-24hrs observation time) [I have NOT enabled off-heap usage]
# JVM heap memory of executors also keeps on bumping up in slow steps.
# JVM RSS of executors and driver keeps increasing but python RSS does not
increase
-Basic operation of counting rows from within sdf.forEachBatch() is being done
to debug ( -Original business logic has Some dropDuplicates, aggregations ,
windowing are being done within the forEachBatch.
-watermarking on a custom timestamp column is being done.
Heap Dump analysis shows large no. of duplicate strings (which look like
generated code). Further large no. of byte[], char[] and UTF8String objects..
Does this point to any potential memory leak in Tungsten optimizer related code?
was:
In Pyspark Structured streaming with Kafka as source and sink, the driver as
well as the executors seem to get OOM killed after a long period of time (few
days). Not able to pinpoint to any specific thing. Prometheus metrics show that
-
# JVM Off-heap memory of both driver and executors keep on increasing over
time (12-24hrs observation time) [I have NOT enabled off-heap usage]
# JVM heap memory of executors also keeps on bumping up in slow steps.
# JVM RSS of executors and driver keeps increasing but python RSS does not
increase
-Basic operation of counting rows from within sdf.forEachBatch() is being done
to debug ( -Original business logic has Some dropDuplicates, aggregations ,
windowing are being done within the forEachBatch.
-watermarking on a custom timestamp column is being done.
Heap Dump analysis shows large no. of duplicate strings (which look like
generated code). Further large no. of byte[], char[] and UTF8String objects..
Does this point to any potential memory leak in Tungsten optimizer related code?
> Memory issue with Structured streaming
> --------------------------------------
>
> Key: SPARK-40927
> URL: https://issues.apache.org/jira/browse/SPARK-40927
> Project: Spark
> Issue Type: Bug
> Components: Structured Streaming
> Affects Versions: 3.3.0, 3.2.2
> Reporter: Mihir Kelkar
> Priority: Major
>
> In Pyspark Structured streaming with Kafka as source and sink, the driver as
> well as the executors seem to get OOM killed after a long period of time
> (8-12hrs). Not able to pinpoint to any specific thing. Prometheus metrics
> show that -
> # JVM Off-heap memory of both driver and executors keep on increasing over
> time (12-24hrs observation time) [I have NOT enabled off-heap usage]
> # JVM heap memory of executors also keeps on bumping up in slow steps.
> # JVM RSS of executors and driver keeps increasing but python RSS does not
> increase
> -Basic operation of counting rows from within sdf.forEachBatch() is being
> done to debug ( -Original business logic has Some dropDuplicates,
> aggregations , windowing are being done within the forEachBatch.
> -watermarking on a custom timestamp column is being done.
>
> Heap Dump analysis shows large no. of duplicate strings (which look like
> generated code). Further large no. of byte[], char[] and UTF8String objects..
> Does this point to any potential memory leak in Tungsten optimizer related
> code?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]