I believe I have uncovered a strange interaction between pySpark, Numpy and Python which produces a memory leak. I wonder if anyone has any ideas of what the issue could be?
I have the following minimal working example ( gist of code <https://gist.github.com/joshlk/58df790a2b0c06b820d2dc078308d970> ): When the above code is run, the memory of the executor's Python process steadily increases after each iteration suggesting the memory of the previous iteration isn't being released. This can lead to a job failure if the memory exceeds the executor's memory limit. Any of the following prevents the memory leak: * Remove the line `data = list(rdd)` * Insert the line `rand_data = list(rand_data.tolist())` after `rand_data = np.random.random(int(1e7))` * Remove the line `int(e)` Some things to take notice of: * While the rdd data is not used in the function, the line is required to reproduce the leak. Reading in the RDD data has to occur as well as the large number of ints to reproduce the leak * The memory leak is likely due to the large Numpy array rand_data not being released * You have to do the int operation on each element of rand_data to reproduce the leak I have experimented with gc and malloc_trim to easy memory usage with no avail. Versions used: EMR 5.12.1, Spark 2.2.1, Python 2.7.13, Numpy 1.14.0 Some more details can be found in a related StackOverflow post <https://stackoverflow.com/questions/53105508/pyspark-numpy-memory-not-being-released-in-executor-map-partition-function-mem> . Any ideas on what the issue could be would be very grateful. Many thanks, Josh -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org