Mihir Kelkar created SPARK-40927:
------------------------------------

             Summary: Memory issue with Structured streaming
                 Key: SPARK-40927
                 URL: https://issues.apache.org/jira/browse/SPARK-40927
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 3.2.2, 3.3.0
            Reporter: Mihir Kelkar


In Pyspark Structured streaming with Kafka as source and sink, the driver as 
well as the executors seem to get OOM killed after a long period of time (few 
days). Not able to pinpoint to any specific thing. Prometheus metrics show that 
-
 # JVM Off-heap memory of both driver and executors keep on increasing over 
time (12-24hrs observation time) [I have NOT enabled off-heap usage]
 # JVM heap memory of executors also keeps on bumping up in slow steps.
 # JVM RSS of executors and driver keeps increasing but python RSS does not 
increase

-Basic operation of counting rows from within sdf.forEachBatch() is being done 
to debug ( -Original business logic has Some dropDuplicates, aggregations , 
windowing are being done within the forEachBatch.

-watermarking on a custom timestamp column is being done. 

 

Heap Dump analysis shows large no. of duplicate strings (which look like 
generated code). Further large no. of byte[], char[] and UTF8String objects.. 
Does this point to any potential memory leak in Tungsten optimizer related code?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to