Brian wrote, >There's an application at Sandia and at Los Alamos which both of which cause >problems for our linker tricks. This leads to such things as (proven) >silent data corruption. Perhaps your users are just getting silent data >corruption and not doing enough validation and verification to know it? Or >maybe Intel's just gotten lucky - the majority of our applications seem to >have no issue with the registration cache. But the outliers with proven >data corruption are the kind of things that keep me up at night.
Have you tried these applications with any MPI other than OpenMPI ? i.e., does this corruption happen with Intel MPI and other MPIs as well? If it is specific to OpenMPI, then perhaps it is just a bug in OpenMPI that can be fixed. >We came with a real problem we're having with code development in real-world >applications, presented two solutions, and were essentially told to take a >hike. If this sounds like a lot of whining to the OFA community, than the >OFA community shouldn't be surprised that the VERBS adoption rate is as poor >as it is. Of the solutions that have been presented so far, I think the kernel notifier approach would be a better solution. Besides the kernel bloat and complexity of a memory registration cache in the kernel, I am not sure it would really be able to work the way you would want. For example, the kernel has no way of knowing when some application calls free(), i.e., free() may not call the kernel to release the memory back to the kernel. It often just puts the free'd memory block on a free list within libc in user-space. Thus, if we had a kernel memory registration cache, from the kernel's perspecive, this block of memory would appear to still be in use and could not be evicted from the cache. Thus the cache could end up filling up with lots of registrations for memory that has already been free()'d in user-space but are stitting on some free list in libc. This is another reason why I think the caching should be done in user-space. If the hooks do not exist in libc to hook all of the appropriate routines, then perhaps you should ask the libc maintainers to add what you need with perhaps the addition of the kernel notifier design that roland suggested. Anyway that would be my suggestion, woody _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
