[jira] [Created] (SPARK-30443) "Managed memory leak detected" even with no calls to take() or limit()

Luke Richter (Jira) Mon, 06 Jan 2020 15:26:45 -0800

Luke Richter created SPARK-30443:
------------------------------------

             Summary: "Managed memory leak detected" even with no calls to 
take() or limit()
                 Key: SPARK-30443
                 URL: https://issues.apache.org/jira/browse/SPARK-30443
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.4, 2.3.2
            Reporter: Luke Richter



Our Spark code is causing a "Managed memory leak detected" warning to appear, 
even though we are not calling take() or limit().


 According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 
managed memory leaks should only be caused by not reading an iterator to 
completion, i.e. take() or limit()

Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed 
memory leak detected; size = 2097152 bytes, TID = 118"
 The size of the managed memory leak is always 2MB.

I have created a minimal test program that reproduces the warning: 
{code:java}
import pyspark.sql
import pyspark.sql.functions as fx


def main():
    builder = pyspark.sql.SparkSession.builder
    builder = builder.appName("spark-jira")
    spark = builder.getOrCreate()

    reader = spark.read
    reader = reader.format("csv")
    reader = reader.option("inferSchema", "true")
    reader = reader.option("header", "true")

    table_c = reader.load("c.csv")
    table_a = reader.load("a.csv")
    table_b = reader.load("b.csv")

    primary_filter = fx.col("some_code").isNull()

    new_primary_data = table_a.filter(primary_filter)

    new_ids = new_primary_data.select("some_id")

    new_data = table_b.join(new_ids, "some_id")

    new_data = new_data.select("some_id")
    result = table_c.join(new_data, "some_id", "left")

    result.repartition(1).write.json("results.json", mode="overwrite")

    spark.stop()


if __name__ == "__main__":
    main()
{code}

 Our code isn't anything out of the ordinary, just some filters, selects and 
joins.

The input data is made up of 3 CSV files. The input data files are quite large, 
roughly 2.6GB in total uncompressed. I attempted to reduce the number of rows 
in the CSV input files but this caused the warning to no longer appear. What is 
the best way to get these test data files that reproduce the warning into your 
hands?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-30443) "Managed memory leak detected" even with no calls to take() or limit()

Reply via email to