Luke Richter created SPARK-30443:
------------------------------------

             Summary: "Managed memory leak detected" even with no calls to 
take() or limit()
                 Key: SPARK-30443
                 URL: https://issues.apache.org/jira/browse/SPARK-30443
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.4, 2.3.2
            Reporter: Luke Richter


Our Spark code is causing a "Managed memory leak detected" warning to appear, 
even though we are not calling take() or limit().


 According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 
managed memory leaks should only be caused by not reading an iterator to 
completion, i.e. take() or limit()

Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed 
memory leak detected; size = 2097152 bytes, TID = 118"
 The size of the managed memory leak is always 2MB.

I have created a minimal test program that reproduces the warning: 
{code:java}
import pyspark.sql
import pyspark.sql.functions as fx


def main():
    builder = pyspark.sql.SparkSession.builder
    builder = builder.appName("spark-jira")
    spark = builder.getOrCreate()

    reader = spark.read
    reader = reader.format("csv")
    reader = reader.option("inferSchema", "true")
    reader = reader.option("header", "true")

    table_c = reader.load("c.csv")
    table_a = reader.load("a.csv")
    table_b = reader.load("b.csv")

    primary_filter = fx.col("some_code").isNull()

    new_primary_data = table_a.filter(primary_filter)

    new_ids = new_primary_data.select("some_id")

    new_data = table_b.join(new_ids, "some_id")

    new_data = new_data.select("some_id")
    result = table_c.join(new_data, "some_id", "left")

    result.repartition(1).write.json("results.json", mode="overwrite")

    spark.stop()


if __name__ == "__main__":
    main()
{code}

 Our code isn't anything out of the ordinary, just some filters, selects and 
joins.

The input data is made up of 3 CSV files. The input data files are quite large, 
roughly 2.6GB in total uncompressed. I attempted to reduce the number of rows 
in the CSV input files but this caused the warning to no longer appear. What is 
the best way to get these test data files that reproduce the warning into your 
hands?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to