u99127 commented on pull request #7048: URL: https://github.com/apache/tvm/pull/7048#issuecomment-742526188
> > > > Thanks @ANSHUMAN87. Overall looks pretty good, but I am a little confused at exactly what is happening is this pass. Is it converting matrices from sparse to dense at compile time? > > Thanks @tkonolige for review! The Densify Op does not need to be present in TVM runtime currently, because its purpose is to convert sparse weights(constant) to dense form. So i have optimized it out. However in case it is required in future to be present in TVM runtime, we can always do it with sparse_to_dense Op with the indices produced from the utility prepare_dense_matrix_from_sparse(). > Sorry about the delay in reviewing . I think there are a couple of design points here that we should consider and take a well-informed decision on this direction. 1. The advantage of keeping the sparse weights representation all the way through is something that helps with keeping the weights size low in the final binary generated which could help with static size / footprint in the final binary at the expense of some runtime (obviously). For usecases where we go through this with BYOC or others where there are weight compression techniques, this should clear up reasonably ok, however where we fall back to traditional tvm compilation on CPUs and GPUs there's no weight compression techniques and thus this contributes to rodata bloat or flash bloat. 2. Even if it doesn't make sense to keep this in the final representation to be expanded at the runtime either in the form of an op for the Relay VM or a new op in the graph runtime - is this operator something that appears in other frameworks and if so does it make sense to keep this behaviour common across the stack to allow us to reuse this logic for other frontends as well i.e. the various frontends lower to a relay.op.densify () and the result of that is something that represents a dense weight tensor. Further legalization could end up creating this dense weight tensor ? One of the principles in compiler design is to try and bind things as late as possible to give us a chance of optimizing and being able to recover easily without too much reverse engineering. So this comes back to that as well. TBH I would probably prefer #2 as an approach to try and reduce any duplications in the frontends and give other compilers or others who use Relay a chance to get to the densify op. I don't think we can use it for any of our work instantly. If the performance or code size becomes an issue we'll need to reconsider a runtime or partial runtime representation of sparse weights at which point #1 would be a stepping stone towards it . These are my thought on the design. regards Ramana > As to your other comments i have tried to add comments in the necessary places. Please have a look. In case anything more required. Please let me know. TIA! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
