Luke Richter created SPARK-30443:
------------------------------------
Summary: "Managed memory leak detected" even with no calls to
take() or limit()
Key: SPARK-30443
URL: https://issues.apache.org/jira/browse/SPARK-30443
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.4.4, 2.3.2
Reporter: Luke Richter
Our Spark code is causing a "Managed memory leak detected" warning to appear,
even though we are not calling take() or limit().
According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168
managed memory leaks should only be caused by not reading an iterator to
completion, i.e. take() or limit()
Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed
memory leak detected; size = 2097152 bytes, TID = 118"
The size of the managed memory leak is always 2MB.
I have created a minimal test program that reproduces the warning:
{code:java}
import pyspark.sql
import pyspark.sql.functions as fx
def main():
builder = pyspark.sql.SparkSession.builder
builder = builder.appName("spark-jira")
spark = builder.getOrCreate()
reader = spark.read
reader = reader.format("csv")
reader = reader.option("inferSchema", "true")
reader = reader.option("header", "true")
table_c = reader.load("c.csv")
table_a = reader.load("a.csv")
table_b = reader.load("b.csv")
primary_filter = fx.col("some_code").isNull()
new_primary_data = table_a.filter(primary_filter)
new_ids = new_primary_data.select("some_id")
new_data = table_b.join(new_ids, "some_id")
new_data = new_data.select("some_id")
result = table_c.join(new_data, "some_id", "left")
result.repartition(1).write.json("results.json", mode="overwrite")
spark.stop()
if __name__ == "__main__":
main()
{code}
Our code isn't anything out of the ordinary, just some filters, selects and
joins.
The input data is made up of 3 CSV files. The input data files are quite large,
roughly 2.6GB in total uncompressed. I attempted to reduce the number of rows
in the CSV input files but this caused the warning to no longer appear. What is
the best way to get these test data files that reproduce the warning into your
hands?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]