> You mentioned that doing this stuff is a choice; the choice that > MPI's/ ULPs/applications therefore have is: > > - don't use registration caches/memory allocation hooking, have > terrible performance > - use registration caches/memory allocation hooking, have good > performance
I think it's a bit of a stretch to suggest that all or even most userspace RDMA applications have the same need for registration caching as MPI. In fact my feeling is that the fact that MPI must deal with RDMA to arbitrary memory allocated by an application out of MPI's control is the exception. My most recent experience was with Cisco's RAB library, and in that case we simply designed the library so that all RDMA was done to memory allocated by the library -- so no need for a registration cache, and in fact no need for registration in any fast path. I suspect that the majority of code written to use RDMA natively will be designed with similar properties. So this proposal is very much an MPI-specific interface. Which leads to my next point. I have no doubt that the MPI community has a very good idea of a memory registration interface that would make MPI implementations simpler and more robust. However I don't think there's quite as much expertise about what the best way to implement such an interface is. My initial reaction is that I don't want to extend the kernel ABI with a set of new MPI-specific verbs if there's a way around it. We've been told over and over that the registration cache is complex and fragile code -- but moving complex and fragile code into the kernel doesn't magically make it any simpler or more robust, it just means that bugs now crash the whole system instead of just affecting one process. Now, of course MMU notifiers allow the kernel to know reliably when a process's page tables change, which means that all the complicated malloc hooking etc is not needed. So that complexity is avoided in the kernel. But suppose I give userspace the same MMU notifier capability (eg I add a system call like "if any mappings in the virtual address range X ... Y change, then write a 1 to virtual address Z") -- then what do I gain from having the rest of the registration caching in the kernel? (And avoiding the duplication of caching code between multiple MPI implementations is not an answer -- it's quite feasible to put the caching code into libibverbs if that's the best place for it) - R. _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
