Numpy memory not being released in executor map-partition function (memory leak)

joshlk_ Tue, 20 Nov 2018 09:56:59 -0800

I believe I have uncovered a strange interaction between pySpark, Numpy and
Python which produces a memory leak. I wonder if anyone has any ideas of
what the issue could be?


I have the following minimal working example ( gist of code
<https://gist.github.com/joshlk/58df790a2b0c06b820d2dc078308d970>  ):



When the above code is run, the memory of the executor's Python process
steadily increases after each iteration suggesting the memory of the
previous iteration isn't being released. This can lead to a job failure if
the memory exceeds the executor's memory limit.

Any of the following prevents the memory leak:
* Remove the line `data = list(rdd)`
* Insert the line `rand_data = list(rand_data.tolist())` after `rand_data =
np.random.random(int(1e7))`
* Remove the line `int(e)`

Some things to take notice of:
* While the rdd data is not used in the function, the line is required to
reproduce the leak. Reading in the RDD data has to occur as well as the
large number of ints to reproduce the leak
* The memory leak is likely due to the large Numpy array rand_data not being
released
* You have to do the int operation on each element of rand_data to reproduce
the leak

I have experimented with gc and malloc_trim to easy memory usage with no
avail.

Versions used: EMR 5.12.1, Spark 2.2.1, Python 2.7.13, Numpy 1.14.0

Some more details can be found in a  related StackOverflow post
<https://stackoverflow.com/questions/53105508/pyspark-numpy-memory-not-being-released-in-executor-map-partition-function-mem>
 
.

Any ideas on what the issue could be would be very grateful.

Many thanks,
Josh



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Numpy memory not being released in executor map-partition function (memory leak)

Reply via email to