taosaildrone commented on issue #23762: [SPARK-21492][SQL][WIP] Memory leak in 
SortMergeJoin
URL: https://github.com/apache/spark/pull/23762#issuecomment-463303175
 
 
   There's a task completion listener, but 1) we could hit an OOME before the 
task completes or 2) impact performance by holding unnecessary memory and 
causing a bunch of unneeded spills. 
   It's holding onto memory that is no longer needed, reducing available memory 
to the point that the application fails when it tries to allocate other memory 
necessary. I would classify that as a memory leak.
   
   Something as simple as joining a DF with 1000 rows with another DF of 2 rows 
(with one overlapping row_id) will cause an OOME (both locally, and on a 
cluster) and cause the job to fail:
   
   An example locally: ./bin/pyspark --master local[10] (tested on 2.4.0 and 
3.0.0-master)
   `
   
   from pyspark.sql.functions import rand, col
   
   spark.conf.set("spark.sql.join.preferSortMergeJoin", "true")
   spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
   
   r1 = spark.range(1, 1001).select(col("id").alias("timestamp1"))
   r1 = r1.withColumn('value', rand())
   r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2"))
   r2 = r2.withColumn('value2', rand())
   joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner")
   joined = joined.coalesce(1)
   joined.explain()
   joined.show()
   ` 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to