On 4/29/09 11:03 , "Roland Dreier" <[email protected]> wrote:
>> But whacky situations might occur in a multithreaded application where >> one thread calls free() while another thread calls malloc(), gets the >> same virtual address that was just free()d but has not yet been >> unregistered in the kernel, so a subsequent ibv_post_send() may >> succeed but be sending the wrong data. >> >> Put simply: in a multi-threaded application, there's always the chance >> that the notify won't get to the user-level process until after the >> global notifier variable has been checked, right? Or, putting it the >> other way: is there any kind of notify system that could be used that >> *can't* create a potential race condition in a multi-threaded user >> application? > > Without thinking too much about the proposal (except that it adds a lot > of new verb interfaces and a lot of kernel code, and therefore feels > like a hassle to me), I don't see how this race is solved by moving a > cache to the kernel. If you think this sounds like a hassle, think about what it looks like from the point of view of the MPI implementer (or any other developer writing libraries which sit between user data and OFED, like GASNet). We don't write kernel modules, can't do much to change libc, and have to compete on performance (particularly benchmarks that send large messages from the same buffer). We're forced into a library-level pin cache to get competitive performance, but don't have the hooks to do it properly. Instead, we try a whole list of hacks to intercept free() and munmap() and hope for the best, often missing. And Open Fabrics is the only "commodity" interfaces that makes implementers go through these pains. Myrinet's MX, Cray's Portals, and Quadric's Tports all handle the issues either at the driver library or kernel module level. One statistic I like to point out (as a supporter of proper offload interconnects and interfaces) is that there are 13,363 lines of code to support InfiniBand within Open MPI, and that doesn't include logic for pin caching, message matching, request management, or multi-nic striping. There are 4560 lines of code to support Cray Portals, and that includes all logic for pin caching, message matching, request management, and multi-nic. Guess which one I think is more complex and feels like a hassle to me? > If you have free()/malloc() of a buffer running in parallel with send > operations targeting the same buffer, then that seems like a buggy MPI > application. Since free()/malloc() might not involve the kernel at all > (the userspace library might keep its own free list, etc) I don't see > how a registration cache in the kernel would help anyway. > > Now, since free()/malloc() operations must be serialized with respect to > send/receive operations in userspace anyway, I don't see why a simpler > (and possibly more flexible/powerful) kernel notifier design can't > work -- if free() releases virtual memory back to the kernel, then the > kernel notifier will run before the free() call returns, so things > should work as planned. Jeff and I talked for a while today, and we're pretty sure that as long as the byte set by the kernel notifier is written before the pages are returned into the unallocated list, there isn't actually a race condition. It does mean that every time the page cache is searched, we also have to check the byte (and likely take a cache miss), but that's not too evil. However, there's still then the problem with the notifier concept of how the kernel passes which pages were given back to the kernel. It has to pass a (potentially very large) amount of data back to the user, so the memory ownership issues with kernel/user space are interesting. It also has to somewhat atomically prepare the list and undset the notifier byte, which is also problematic. But probably workable. So perhaps the notifier method would be sufficient after all. Brian -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
