[jira] [Comment Edited] (SPARK-30443) "Managed memory leak detected" even with no calls to take() or limit()

Xiaoju Wu (Jira) Thu, 26 Mar 2020 22:51:10 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-30443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068290#comment-17068290
 ]


Xiaoju Wu edited comment on SPARK-30443 at 3/27/20, 5:50 AM:
-------------------------------------------------------------

Also see this kind of warning logs. SPARK-21492 may relate to this warning. 
Does your code base contain it?
And I'm afraid there could be other consumers not release memory by themselves 
but let the task release all memory related to taskId at the end of task.


was (Author: xiaojuwu):
Also see this kind of warning logs. SPARK-21492 may relate to this warning. 
Does your code base contain it?

> "Managed memory leak detected" even with no calls to take() or limit()
> ----------------------------------------------------------------------
>
>                 Key: SPARK-30443
>                 URL: https://issues.apache.org/jira/browse/SPARK-30443
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.2, 2.4.4, 3.0.0
>            Reporter: Luke Richter
>            Priority: Major
>         Attachments: a.csv.zip, b.csv.zip, c.csv.zip
>
>
> Our Spark code is causing a "Managed memory leak detected" warning to appear, 
> even though we are not calling take() or limit().
> According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 
> managed memory leaks should only be caused by not reading an iterator to 
> completion, i.e. take() or limit()
> Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed 
> memory leak detected; size = 2097152 bytes, TID = 118"
>  The size of the managed memory leak is always 2MB.
> I have created a minimal test program that reproduces the warning: 
> {code:java}
> import pyspark.sql
> import pyspark.sql.functions as fx
> def main():
>     builder = pyspark.sql.SparkSession.builder
>     builder = builder.appName("spark-jira")
>     spark = builder.getOrCreate()
>     reader = spark.read
>     reader = reader.format("csv")
>     reader = reader.option("inferSchema", "true")
>     reader = reader.option("header", "true")
>     table_c = reader.load("c.csv")
>     table_a = reader.load("a.csv")
>     table_b = reader.load("b.csv")
>     primary_filter = fx.col("some_code").isNull()
>     new_primary_data = table_a.filter(primary_filter)
>     new_ids = new_primary_data.select("some_id")
>     new_data = table_b.join(new_ids, "some_id")
>     new_data = new_data.select("some_id")
>     result = table_c.join(new_data, "some_id", "left")
>     result.repartition(1).write.json("results.json", mode="overwrite")
>     spark.stop()
> if __name__ == "__main__":
>     main()
> {code}
> Our code isn't anything out of the ordinary, just some filters, selects and 
> joins.
> The input data is made up of 3 CSV files. The input data files are quite 
> large, roughly 2.6GB in total uncompressed. I attempted to reduce the number 
> of rows in the CSV input files but this caused the warning to no longer 
> appear. After compressing the files I was able to attach them below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-30443) "Managed memory leak detected" even with no calls to take() or limit()

Reply via email to