Thank you all for your feedback. @Chris: Yes, One of the Amazon user(Calum Leslie) had contributed the Dispose Pattern removing the free of native handles in Finalizers and instead added Log. This was done because calling free in Finalizers was segfaulting the application at random points and was very hard to reproduce and debug. The dispose pattern worked for some cases but made code cumbersome from a readability aspect, keeping track of all the objects that were created(imagine slice/reshape instead of writing expressions you are now creating unnecessary variables and calling dispose on them). As the 1st graph in the design shows despite carefully calling dispose on most objects, there was constant memory leak and diagnosing leaks wasn't straightforward. Note that Finalizers run on a separate thread later than the object was found unreachable.
@Timur, thanks for the feedback. 1) No, the goal here is to manage Native memory that is created for various operations. In MXNet-Scala most objects are in C++ Heap and Scala objects are wrappers around it, the MXNet engine when it runs operations expects objects to be accessible in C++ Heap. 2) Agree MNIST is not representative, the goal was to understand and show that the existing code has hard to debug memory leaks(even for MNIST). I was aiming to test my prototype code and see if my changes make a difference. Yizhi suggested I run tests against RESNET50 model which I will do as a part of my implementation. I think this is a standard benchmark model that is widely used. Also note that most of MXNet-Scala's use-case that we have seen is for Inference. 3) No, we haven't created a branch for Java-API work, please look at this design and kindly leave your feedback: https://cwiki.apache.org/confluence/display/MXNET/MXNet+Java+Inference+API 4) Calling System.gc() will be configurable(including don't call GC), one of the feedback that I got from a User is calling System.gc on the user's behalf is intrusive which i think is also the point you are making. 5) understood and agree, I see the calling GC as only a part of the solution and configurable option. For using GPUs, training and other memory intensive application ResourceScope is be a very good option. Another alternative is to create Bytebuffers in Java and map the C++ pointers to JVM heap by tapping to the native malloc/free that way JVM is aware of all the memory that is allocated and can free appropriately whenever the objects becomes unreachable. I have to note that this still does not solve the problem of accumulating memory until GC has kicked in. This approach is too very involved and might not be tenable. @Marco, thanks for your comments. 1) JVM kicks of GC when it feels pressure on JVM Heap not CPU RAM. Objects on GPU are no special they are still off-heap(JVM Heap) so this would work, look at the graph that show running GAN example on GPUs in the doc. 2) I am not looking to rewrite the Memory Allocation in MXNet, that will still be handled by the C++ backend, the goal here is to free(reduce of shared pointer count) native-memory when JVM objects go out of scope(become unreachable). @Carin, yes hopefully this would alleviate the memory management headache for our users. Hope that makes sense. Thanks, Naveen On Wed, Sep 12, 2018 at 6:06 AM, Carin Meier <carinme...@gmail.com> wrote: > Naveen, > > Thanks for putting together the detailed document and kickstarting this > effort. It will benefit all the MXNet JVM users and will help solve a > current pain point for them. > > - Carin > > On Tue, Sep 11, 2018 at 5:37 PM Naveen Swamy <mnnav...@gmail.com> wrote: > > > Hi All, > > > > I am working on managing Off-Heap Memory Management and have written a > > proposal here based on my prototype and research I did. > > > > Please review the doc and provide your feedback ? > > > > https://cwiki.apache.org/confluence/display/MXNET/JVM+Memory+Management > > > > I had offline discussion with a few people I work with and added their > > feedback to the doc as well. > > > > Thanks, Naveen > > >