taosaildrone commented on issue #23762: [SPARK-21492][SQL][WIP] Memory leak in SortMergeJoin URL: https://github.com/apache/spark/pull/23762#issuecomment-463303175 There's a task completion listener, but 1) we could hit an OOME before the task completes or 2) impact performance by holding unnecessary memory and causing a bunch of unneeded spills. It's holding onto memory that is no longer needed, reducing available memory to the point that the application fails when it tries to allocate other memory necessary. I would classify that as a memory leak. Something as simple as joining a DF with 1000 rows with another DF of 2 rows (with one overlapping row_id) will cause an OOME (both locally, and on a cluster) and cause the job to fail: An example locally: ./bin/pyspark --master local[10] (tested on 2.4.0 and 3.0.0-master) ` from pyspark.sql.functions import rand, col spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) r1 = spark.range(1, 1001).select(col("id").alias("timestamp1")) r1 = r1.withColumn('value', rand()) r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2")) r2 = r2.withColumn('value2', rand()) joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner") joined = joined.coalesce(1) joined.explain() joined.show() `
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
