AndrewZhaoLuo opened a new pull request, #13877: URL: https://github.com/apache/tvm/pull/13877
Right now there is a bad pattern in VM executable where when loading weights, we load serialized representation in memory, and then deserialize off the in-memory store without progressively freeing memory. This is bad because if our weights take up ~ 5GB, then the serialized representation in memory takes up 5GB and the deserialized representation will take ~ 5 GB too. This means peak memory use for using the VM for execution is 2 * the size of the weight models. This is bad, especially with some of the larger models out there today. This fixes thing by using a stream from disk, and depending on the standard C file interface to buffer things for performant results. Some before and after graphs though loading and benchmarking a model with ~5GB weights: Before:  After:  This is a draft since: - I've only tested loading weights, but we can see similar savings in other similar streams. - We need to make a decision on DMLC stream interface. The main issue is that a lot of existing code depends on DMLC stream interface, but DMLC itself is a header only library. We only have access to in-memory streams in the current state. The way I have gotten around this is by implementing a simple class. - We need to decide best way forward. The one in this PR is simple, though technically duplicates some code from DMLC core lib - Alternatives are including DMLC as dependency, adding to DMLC functionality and pulling those things changes, or get rid of DMLC stream interface entirely - This one is the simplest which is why I will do this for the draft. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
