RE: [ofa-general] Memory registration redux
Hay Roland, One question from my MPI guys. Looks like you have added the ability to have more than one version of the device to allow future versions, i.e., the .intf_version in the register call. struct ummunot_register_ioctl r = { .intf_version = UMMUNOT_INTF_VERSION, .start = (unsigned long) buf, .end= (unsigned long) buf + size, .user_cookie= cookie, }; I assume there will be some ioctl to allow a program to discover at runtime the version(s) of the device that are supported on a particular system ? woody -Original Message- From: general-boun...@lists.openfabrics.org [mailto:general-boun...@lists.openfabrics.org] On Behalf Of Roland Dreier Sent: Tuesday, May 26, 2009 4:14 PM To: Jason Gunthorpe Cc: Pavel Shamis; Hans Westgaard Ry; Dontje; Lenny Verkhovsky; H??kon Bugge; Donald Kerr; OpenFabrics General; Supalov, Alexander Subject: Re: [ofa-general] Memory registration redux Here's the test program: #include fcntl.h #include stdio.h #include unistd.h #include linux/types.h #include linux/ioctl.h #include sys/mman.h #include sys/stat.h #include sys/types.h #define UMMUNOT_INTF_VERSION1 enum { UMMUNOT_EVENT_TYPE_INVAL= 0, UMMUNOT_EVENT_TYPE_LAST = 1, }; enum { UMMUNOT_EVENT_FLAG_HINT = 1 0, }; /* * If type field is INVAL, then user_cookie_counter holds the * user_cookie for the region being reported; if the HINT flag is set * then hint_start/hint_end hold the start and end of the mapping that * was invalidated. (If HINT is not set, then multiple events * invalidated parts of the registered range and hint_start/hint_end * should be ignored) * * If type is LAST, then the read operation has emptied the list of * invalidated regions, and user_cookie_counter holds the value of the * kernel's generation counter when the empty list occurred. The * other fields are not filled in for this event. */ struct ummunot_event { __u32 type; __u32 flags; __u64 hint_start; __u64 hint_end; __u64 user_cookie_counter; }; struct ummunot_register_ioctl { __u32 intf_version; /* in */ __u32 reserved1; __u64 start; /* in */ __u64 end;/* in */ __u64 user_cookie;/* in */ }; #define UMMUNOT_MAGIC 'U' #define UMMUNOT_REGISTER_REGION _IOWR(UMMUNOT_MAGIC, 1, \ struct ummunot_register_ioctl) #define UMMUNOT_UNREGISTER_REGION _IOW(UMMUNOT_MAGIC, 2, __u64) static int umn_fd; static volatile unsigned long long *umn_counter; static int umn_init(void) { umn_fd = open(/dev/ummunot, O_RDONLY); if (umn_fd 0) { perror(open); return 1; } umn_counter = mmap(NULL, sizeof *umn_counter, PROT_READ, MAP_SHARED, umn_fd, 0); if (umn_counter == MAP_FAILED) { perror(mmap); return 1; } return 0; } static int umn_register(void *buf, size_t size, __u64 cookie) { struct ummunot_register_ioctl r = { .intf_version = UMMUNOT_INTF_VERSION, .start = (unsigned long) buf, .end= (unsigned long) buf + size, .user_cookie= cookie, }; if (ioctl(umn_fd, UMMUNOT_REGISTER_REGION, r)) { perror(ioctl); return 1; } return 0; } static int umn_unregister(__u64 cookie) { if (ioctl(umn_fd, UMMUNOT_UNREGISTER_REGION, cookie)) { perror(ioctl); return 1; } return 0; } int main(int argc, char *argv[]) { int page_size = sysconf(_SC_PAGESIZE); void *t; if (umn_init()) return 1; if (*umn_counter != 0) { fprintf(stderr, counter = %lld (expected 0)\n, *umn_counter); return 1; } t = mmap(NULL, 3 * page_size, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0); if (umn_register(t, 3 * page_size, 123)) return 1; munmap(t + page_size, page_size); printf(ummunot events: %lld\n, *umn_counter); if (*umn_counter 0) { struct ummunot_event ev[2]; int len; int i; len = read(umn_fd, ev, sizeof ev); printf(read %d events (%d tot)\n, len / sizeof ev[0], len); for (i = 0; i len / sizeof ev[0]; ++i) { switch (ev[i].type) { case UMMUNOT_EVENT_TYPE_INVAL: printf([%3d]: inval cookie %lld\n, i, ev[i].user_cookie_counter
Re: [ofa-general] Memory registration redux
I assume there will be some ioctl to allow a program to discover at runtime the version(s) of the device that are supported on a particular system ? Yeah, I guess. I haven't really thought through the forwards compat completely I guess. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Memory registration redux
Hi, Intel MPI developers are in principle OK with this proposal. What way of delivery is envisioned? Will this become a part of OFED or of the mainstream kernel? How fast will it spread? Are there any comparable Windows plans? Best regards. Alexander -Original Message- From: Supalov, Alexander Sent: Wednesday, June 03, 2009 12:26 PM To: 'Roland Dreier' Cc: Jeff Squyres; Pavel Shamis; Hans Westgaard Ry; Dontje; Lenny Verkhovsky; H??kon Bugge; Donald Kerr; OpenFabrics General Subject: RE: [ofa-general] Memory registration redux Thanks. This is what I was looking for. Let me pass this by the key Intel MPI developers and get back to you. -Original Message- From: Roland Dreier [mailto:rdre...@cisco.com] Sent: Tuesday, June 02, 2009 6:45 PM To: Supalov, Alexander Cc: Jeff Squyres; Pavel Shamis; Hans Westgaard Ry; Dontje; Lenny Verkhovsky; H??kon Bugge; Donald Kerr; OpenFabrics General Subject: Re: [ofa-general] Memory registration redux Sorry, it's kind of difficult to deduce looking at this QA sequence what works how and when. Is it possible to create a brief and direct description of the proposed solution? Did you see the original patch description I sent: As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925 and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunot, that creates a /dev/ummunot node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunot_register_ioctl in linux/ummunot.h). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunot_event in linux/ummunot.h). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. - Intel GmbH Dornacher Strasse 1 85622 Feldkirchen/Muenchen Germany Sitz der Gesellschaft: Feldkirchen bei Muenchen Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer Registergericht: Muenchen HRB 47456 Ust.-IdNr. VAT Registration No.: DE129385895 Citibank Frankfurt (BLZ 502 109 00) 600119052 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
Supalov, Alexander wrote: Hi, Intel MPI developers are in principle OK with this proposal. What way of delivery is envisioned? Will this become a part of OFED or of the mainstream kernel? Roland is planing to push it to kernel 2.6.31 And OFED will take it from the kernel. We will check if we can do backports for distros. I assume it will be available only for distros that have the MMU notifiers in the kernel. How fast will it spread? Are there any comparable Windows plans? I cannot answer on Windows Tziporet ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Memory registration redux
Are there any comparable Windows plans? I believe that Windows already provides an equivalent functionality as part of the OS (Windows 2008 / Vista). See SecureMemoryCacheCallback. There are no plans for WinOF to provide anything separately from this. (It's likely impossible without OS support anyway.) - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Memory registration redux
Thanks. This is what I was looking for. Let me pass this by the key Intel MPI developers and get back to you. -Original Message- From: Roland Dreier [mailto:rdre...@cisco.com] Sent: Tuesday, June 02, 2009 6:45 PM To: Supalov, Alexander Cc: Jeff Squyres; Pavel Shamis; Hans Westgaard Ry; Dontje; Lenny Verkhovsky; H??kon Bugge; Donald Kerr; OpenFabrics General Subject: Re: [ofa-general] Memory registration redux Sorry, it's kind of difficult to deduce looking at this QA sequence what works how and when. Is it possible to create a brief and direct description of the proposed solution? Did you see the original patch description I sent: As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925 and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunot, that creates a /dev/ummunot node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunot_register_ioctl in linux/ummunot.h). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunot_event in linux/ummunot.h). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. - Intel GmbH Dornacher Strasse 1 85622 Feldkirchen/Muenchen Germany Sitz der Gesellschaft: Feldkirchen bei Muenchen Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer Registergericht: Muenchen HRB 47456 Ust.-IdNr. VAT Registration No.: DE129385895 Citibank Frankfurt (BLZ 502 109 00) 600119052 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Memory registration redux
Hi, Sorry, it's kind of difficult to deduce looking at this QA sequence what works how and when. Is it possible to create a brief and direct description of the proposed solution? Best regards. Alexander -Original Message- From: Jeff Squyres [mailto:jsquy...@cisco.com] Sent: Wednesday, May 27, 2009 9:03 PM To: Roland Dreier (rdreier) Cc: Pavel Shamis; Hans Westgaard Ry; Dontje; Lenny Verkhovsky; H??kon Bugge; Donald Kerr; OpenFabrics General; Supalov, Alexander Subject: Re: [ofa-general] Memory registration redux Other MPI implementors -- what do you think of this scheme? On May 27, 2009, at 1:49 PM, Roland Dreier (rdreier) wrote: /* * If type field is INVAL, then user_cookie_counter holds the * user_cookie for the region being reported; if the HINT flag is set * then hint_start/hint_end hold the start and end of the mapping that * was invalidated. (If HINT is not set, then multiple events * invalidated parts of the registered range and hint_start/ hint_end * should be ignored) I don't quite grok this. Is the intent that HINT will only be set if an *entire* hint_start/hint_end range is invalidated by a single event? I.e., if only part of the hint_start/hint_end range is invalidated, you'll get the cookie back, but not what part of the range is invalid (because assumedly the entire IBV registration is now invalid anyway)? Basically, I just keep one hint_start/hint_end. If multiple events hit the same registration then I just give up and don't give you a hint. * If type is LAST, then the read operation has emptied the list of * invalidated regions, and user_cookie_counter holds the value of the * kernel's generation counter when the empty list occurred. The * other fields are not filled in for this event. Just to be clear -- we're supposed to keep reading events until we get a LAST event? Yes, that's probably the sanest use case. 1. Will it increase by 1 each time a page (or set of pages?) is removed from a user process? As it stands it increases by 1 every time there is an MMU notification, even if that notification hits multiple registrations. It wouldn't be hard to change that to count the number of events generated if that works better. 2. Does it change if pages are *added* to a user process? I.e., does the counter indicate *removals* or *changes* to the user process page table? No, additions don't trigger any MMU notification -- that's inherent in the design of the MMU notifiers stuff. The idea is that you have a secondary MMU and MMU notifications are the equivalent of TLB shootdowns; the secondary MMU is responsible for populating itself on faults etc. Is the *unm_counter value guaranteed to have been changed by the time munmap() returns? Yes. Did you pick [2] here simply because you're only expecting an INVAL and a LAST event in this specific example? I'm assuming that we should normally loop over reading until we get LAST, correct? Right. What happens if I register multiple regions with the same cookie value? You get in trouble -- I need to fix things to reject duplicated cookies actually, because otherwise there's no way to unregister. Is a process responsible for guaranteeing that it umn_unregister()s everything before exiting, or will all pending registrations be cleaned up/unregistered/whatever when a process exits? The kernel cleans up of course to handle crashes etc. - R. -- Jeff Squyres Cisco Systems - Intel GmbH Dornacher Strasse 1 85622 Feldkirchen/Muenchen Germany Sitz der Gesellschaft: Feldkirchen bei Muenchen Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer Registergericht: Muenchen HRB 47456 Ust.-IdNr. VAT Registration No.: DE129385895 Citibank Frankfurt (BLZ 502 109 00) 600119052 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
Sorry, it's kind of difficult to deduce looking at this QA sequence what works how and when. Is it possible to create a brief and direct description of the proposed solution? Did you see the original patch description I sent: As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925 and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunot, that creates a /dev/ummunot node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunot_register_ioctl in linux/ummunot.h). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunot_event in linux/ummunot.h). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
The scheme looks fine to me ! Hans W. Ry Jeff Squyres skrev: Other MPI implementors -- what do you think of this scheme? On May 27, 2009, at 1:49 PM, Roland Dreier (rdreier) wrote: /* * If type field is INVAL, then user_cookie_counter holds the * user_cookie for the region being reported; if the HINT flag is set * then hint_start/hint_end hold the start and end of the mapping that * was invalidated. (If HINT is not set, then multiple events * invalidated parts of the registered range and hint_start/hint_end * should be ignored) I don't quite grok this. Is the intent that HINT will only be set if an *entire* hint_start/hint_end range is invalidated by a single event? I.e., if only part of the hint_start/hint_end range is invalidated, you'll get the cookie back, but not what part of the range is invalid (because assumedly the entire IBV registration is now invalid anyway)? Basically, I just keep one hint_start/hint_end. If multiple events hit the same registration then I just give up and don't give you a hint. * If type is LAST, then the read operation has emptied the list of * invalidated regions, and user_cookie_counter holds the value of the * kernel's generation counter when the empty list occurred. The * other fields are not filled in for this event. Just to be clear -- we're supposed to keep reading events until we get a LAST event? Yes, that's probably the sanest use case. 1. Will it increase by 1 each time a page (or set of pages?) is removed from a user process? As it stands it increases by 1 every time there is an MMU notification, even if that notification hits multiple registrations. It wouldn't be hard to change that to count the number of events generated if that works better. 2. Does it change if pages are *added* to a user process? I.e., does the counter indicate *removals* or *changes* to the user process page table? No, additions don't trigger any MMU notification -- that's inherent in the design of the MMU notifiers stuff. The idea is that you have a secondary MMU and MMU notifications are the equivalent of TLB shootdowns; the secondary MMU is responsible for populating itself on faults etc. Is the *unm_counter value guaranteed to have been changed by the time munmap() returns? Yes. Did you pick [2] here simply because you're only expecting an INVAL and a LAST event in this specific example? I'm assuming that we should normally loop over reading until we get LAST, correct? Right. What happens if I register multiple regions with the same cookie value? You get in trouble -- I need to fix things to reject duplicated cookies actually, because otherwise there's no way to unregister. Is a process responsible for guaranteeing that it umn_unregister()s everything before exiting, or will all pending registrations be cleaned up/unregistered/whatever when a process exits? The kernel cleans up of course to handle crashes etc. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
Sounds good for me, Jeff Squyres wrote: Other MPI implementors -- what do you think of this scheme? On May 27, 2009, at 1:49 PM, Roland Dreier (rdreier) wrote: /* * If type field is INVAL, then user_cookie_counter holds the * user_cookie for the region being reported; if the HINT flag is set * then hint_start/hint_end hold the start and end of the mapping that * was invalidated. (If HINT is not set, then multiple events * invalidated parts of the registered range and hint_start/hint_end * should be ignored) I don't quite grok this. Is the intent that HINT will only be set if an *entire* hint_start/hint_end range is invalidated by a single event? I.e., if only part of the hint_start/hint_end range is invalidated, you'll get the cookie back, but not what part of the range is invalid (because assumedly the entire IBV registration is now invalid anyway)? Basically, I just keep one hint_start/hint_end. If multiple events hit the same registration then I just give up and don't give you a hint. * If type is LAST, then the read operation has emptied the list of * invalidated regions, and user_cookie_counter holds the value of the * kernel's generation counter when the empty list occurred. The * other fields are not filled in for this event. Just to be clear -- we're supposed to keep reading events until we get a LAST event? Yes, that's probably the sanest use case. 1. Will it increase by 1 each time a page (or set of pages?) is removed from a user process? As it stands it increases by 1 every time there is an MMU notification, even if that notification hits multiple registrations. It wouldn't be hard to change that to count the number of events generated if that works better. 2. Does it change if pages are *added* to a user process? I.e., does the counter indicate *removals* or *changes* to the user process page table? No, additions don't trigger any MMU notification -- that's inherent in the design of the MMU notifiers stuff. The idea is that you have a secondary MMU and MMU notifications are the equivalent of TLB shootdowns; the secondary MMU is responsible for populating itself on faults etc. Is the *unm_counter value guaranteed to have been changed by the time munmap() returns? Yes. Did you pick [2] here simply because you're only expecting an INVAL and a LAST event in this specific example? I'm assuming that we should normally loop over reading until we get LAST, correct? Right. What happens if I register multiple regions with the same cookie value? You get in trouble -- I need to fix things to reject duplicated cookies actually, because otherwise there's no way to unregister. Is a process responsible for guaranteeing that it umn_unregister()s everything before exiting, or will all pending registrations be cleaned up/unregistered/whatever when a process exits? The kernel cleans up of course to handle crashes etc. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
On May 26, 2009, at 7:13 PM, Roland Dreier (rdreier) wrote: /* * If type field is INVAL, then user_cookie_counter holds the * user_cookie for the region being reported; if the HINT flag is set * then hint_start/hint_end hold the start and end of the mapping that * was invalidated. (If HINT is not set, then multiple events * invalidated parts of the registered range and hint_start/hint_end * should be ignored) I don't quite grok this. Is the intent that HINT will only be set if an *entire* hint_start/hint_end range is invalidated by a single event? I.e., if only part of the hint_start/hint_end range is invalidated, you'll get the cookie back, but not what part of the range is invalid (because assumedly the entire IBV registration is now invalid anyway)? * If type is LAST, then the read operation has emptied the list of * invalidated regions, and user_cookie_counter holds the value of the * kernel's generation counter when the empty list occurred. The * other fields are not filled in for this event. Just to be clear -- we're supposed to keep reading events until we get a LAST event? if (*umn_counter != 0) { fprintf(stderr, counter = %lld (expected 0)\n, *umn_counter); return 1; } Some clarification questions about umn_counter: 1. Will it increase by 1 each time a page (or set of pages?) is removed from a user process? 2. Does it change if pages are *added* to a user process? I.e., does the counter indicate *removals* or *changes* to the user process page table? t = mmap(NULL, 3 * page_size, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0); if (umn_register(t, 3 * page_size, 123)) return 1; munmap(t + page_size, page_size); printf(ummunot events: %lld\n, *umn_counter); if (*umn_counter 0) { Is the *unm_counter value guaranteed to have been changed by the time munmap() returns? struct ummunot_event ev[2]; Did you pick [2] here simply because you're only expecting an INVAL and a LAST event in this specific example? I'm assuming that we should normally loop over reading until we get LAST, correct? int len; int i; len = read(umn_fd, ev, sizeof ev); printf(read %d events (%d tot)\n, len / sizeof ev[0], len); for (i = 0; i len / sizeof ev[0]; ++i) { switch (ev[i].type) { case UMMUNOT_EVENT_TYPE_INVAL: printf([%3d]: inval cookie %lld\n, i, ev[i].user_cookie_counter); if (ev[i].flags UMMUNOT_EVENT_FLAG_HINT) printf( hint %llx...%lx\n, ev[i].hint_start, ev[i].hint_end); break; case UMMUNOT_EVENT_TYPE_LAST: printf([%3d]: empty up to %lld\n, i, ev[i].user_cookie_counter); break; default: printf([%3d]: unknown event type %d \n, i, ev[i].type); break; } } } umn_unregister(123); What happens if I register multiple regions with the same cookie value? Is a process responsible for guaranteeing that it umn_unregister()s everything before exiting, or will all pending registrations be cleaned up/unregistered/whatever when a process exits? -- Jeff Squyres Cisco Systems ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
/* * If type field is INVAL, then user_cookie_counter holds the * user_cookie for the region being reported; if the HINT flag is set * then hint_start/hint_end hold the start and end of the mapping that * was invalidated. (If HINT is not set, then multiple events * invalidated parts of the registered range and hint_start/hint_end * should be ignored) I don't quite grok this. Is the intent that HINT will only be set if an *entire* hint_start/hint_end range is invalidated by a single event? I.e., if only part of the hint_start/hint_end range is invalidated, you'll get the cookie back, but not what part of the range is invalid (because assumedly the entire IBV registration is now invalid anyway)? Basically, I just keep one hint_start/hint_end. If multiple events hit the same registration then I just give up and don't give you a hint. * If type is LAST, then the read operation has emptied the list of * invalidated regions, and user_cookie_counter holds the value of the * kernel's generation counter when the empty list occurred. The * other fields are not filled in for this event. Just to be clear -- we're supposed to keep reading events until we get a LAST event? Yes, that's probably the sanest use case. 1. Will it increase by 1 each time a page (or set of pages?) is removed from a user process? As it stands it increases by 1 every time there is an MMU notification, even if that notification hits multiple registrations. It wouldn't be hard to change that to count the number of events generated if that works better. 2. Does it change if pages are *added* to a user process? I.e., does the counter indicate *removals* or *changes* to the user process page table? No, additions don't trigger any MMU notification -- that's inherent in the design of the MMU notifiers stuff. The idea is that you have a secondary MMU and MMU notifications are the equivalent of TLB shootdowns; the secondary MMU is responsible for populating itself on faults etc. Is the *unm_counter value guaranteed to have been changed by the time munmap() returns? Yes. Did you pick [2] here simply because you're only expecting an INVAL and a LAST event in this specific example? I'm assuming that we should normally loop over reading until we get LAST, correct? Right. What happens if I register multiple regions with the same cookie value? You get in trouble -- I need to fix things to reject duplicated cookies actually, because otherwise there's no way to unregister. Is a process responsible for guaranteeing that it umn_unregister()s everything before exiting, or will all pending registrations be cleaned up/unregistered/whatever when a process exits? The kernel cleans up of course to handle crashes etc. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
Fixed version below -- returns EINVAL for an attempt to reuse a user cookie (since otherwise unregister would get confused). === ummunot: Userspace support for MMU notifications As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925 and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunot, that creates a /dev/ummunot node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunot_register_ioctl in linux/ummunot.h). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunot_event in linux/ummunot.h). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. NOT-YET-Signed-off-by: Roland Dreier rola...@cisco.com --- drivers/char/Kconfig| 12 ++ drivers/char/Makefile |1 + drivers/char/ummunot.c | 457 +++ include/linux/ummunot.h | 85 + 4 files changed, 555 insertions(+), 0 deletions(-) create mode 100644 drivers/char/ummunot.c create mode 100644 include/linux/ummunot.h diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 735bbe2..91fe068 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -1099,6 +1099,18 @@ config DEVPORT depends on ISA || PCI default y +config UMMUNOT + tristate Userspace MMU notifications + select MMU_NOTIFIER + help + The ummunot (userspace MMU notification) driver creates a + character device that can be used by userspace libraries to + get notifications when an application's memory mapping + changed. This is used, for example, by RDMA libraries to + improve the reliability of memory registration caching, since + the kernel's MMU notifications can be used to know precisely + when to shoot down a cached registration. + source drivers/s390/char/Kconfig endmenu diff --git a/drivers/char/Makefile b/drivers/char/Makefile index 9caf5b5..dcbcd7c 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -97,6 +97,7 @@ obj-$(CONFIG_CS5535_GPIO) += cs5535_gpio.o obj-$(CONFIG_GPIO_VR41XX) += vr41xx_giu.o obj-$(CONFIG_GPIO_TB0219) += tb0219.o obj-$(CONFIG_TELCLOCK) += tlclk.o +obj-$(CONFIG_UMMUNOT) += ummunot.o obj-$(CONFIG_MWAVE)+= mwave/ obj-$(CONFIG_AGP) += agp/ diff --git a/drivers/char/ummunot.c b/drivers/char/ummunot.c new file mode 100644 index 000..1341edc --- /dev/null +++ b/drivers/char/ummunot.c @@ -0,0 +1,457 @@ +/* + * Copyright (c) 2009 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenFabrics BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include linux/fs.h +#include linux/init.h +#include linux/list.h +#include linux/miscdevice.h +#include linux/mm.h +#include
Re: [ofa-general] Memory registration redux
Other MPI implementors -- what do you think of this scheme? On May 27, 2009, at 1:49 PM, Roland Dreier (rdreier) wrote: /* * If type field is INVAL, then user_cookie_counter holds the * user_cookie for the region being reported; if the HINT flag is set * then hint_start/hint_end hold the start and end of the mapping that * was invalidated. (If HINT is not set, then multiple events * invalidated parts of the registered range and hint_start/ hint_end * should be ignored) I don't quite grok this. Is the intent that HINT will only be set if an *entire* hint_start/hint_end range is invalidated by a single event? I.e., if only part of the hint_start/hint_end range is invalidated, you'll get the cookie back, but not what part of the range is invalid (because assumedly the entire IBV registration is now invalid anyway)? Basically, I just keep one hint_start/hint_end. If multiple events hit the same registration then I just give up and don't give you a hint. * If type is LAST, then the read operation has emptied the list of * invalidated regions, and user_cookie_counter holds the value of the * kernel's generation counter when the empty list occurred. The * other fields are not filled in for this event. Just to be clear -- we're supposed to keep reading events until we get a LAST event? Yes, that's probably the sanest use case. 1. Will it increase by 1 each time a page (or set of pages?) is removed from a user process? As it stands it increases by 1 every time there is an MMU notification, even if that notification hits multiple registrations. It wouldn't be hard to change that to count the number of events generated if that works better. 2. Does it change if pages are *added* to a user process? I.e., does the counter indicate *removals* or *changes* to the user process page table? No, additions don't trigger any MMU notification -- that's inherent in the design of the MMU notifiers stuff. The idea is that you have a secondary MMU and MMU notifications are the equivalent of TLB shootdowns; the secondary MMU is responsible for populating itself on faults etc. Is the *unm_counter value guaranteed to have been changed by the time munmap() returns? Yes. Did you pick [2] here simply because you're only expecting an INVAL and a LAST event in this specific example? I'm assuming that we should normally loop over reading until we get LAST, correct? Right. What happens if I register multiple regions with the same cookie value? You get in trouble -- I need to fix things to reject duplicated cookies actually, because otherwise there's no way to unregister. Is a process responsible for guaranteeing that it umn_unregister()s everything before exiting, or will all pending registrations be cleaned up/unregistered/whatever when a process exits? The kernel cleans up of course to handle crashes etc. - R. -- Jeff Squyres Cisco Systems ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
Sigh... real version that returns EINVAL for an attempt to reuse a user cookie (since otherwise unregister would get confused). Previous posting was the old patch, sorry. === ummunot: Userspace support for MMU notifications As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925 and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunot, that creates a /dev/ummunot node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunot_register_ioctl in linux/ummunot.h). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunot_event in linux/ummunot.h). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. Signed-off-by: Roland Dreier rola...@cisco.com --- drivers/char/Kconfig| 12 ++ drivers/char/Makefile |1 + drivers/char/ummunot.c | 469 +++ include/linux/ummunot.h | 85 + 4 files changed, 567 insertions(+), 0 deletions(-) create mode 100644 drivers/char/ummunot.c create mode 100644 include/linux/ummunot.h diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 735bbe2..91fe068 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -1099,6 +1099,18 @@ config DEVPORT depends on ISA || PCI default y +config UMMUNOT + tristate Userspace MMU notifications + select MMU_NOTIFIER + help + The ummunot (userspace MMU notification) driver creates a + character device that can be used by userspace libraries to + get notifications when an application's memory mapping + changed. This is used, for example, by RDMA libraries to + improve the reliability of memory registration caching, since + the kernel's MMU notifications can be used to know precisely + when to shoot down a cached registration. + source drivers/s390/char/Kconfig endmenu diff --git a/drivers/char/Makefile b/drivers/char/Makefile index 9caf5b5..dcbcd7c 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -97,6 +97,7 @@ obj-$(CONFIG_CS5535_GPIO) += cs5535_gpio.o obj-$(CONFIG_GPIO_VR41XX) += vr41xx_giu.o obj-$(CONFIG_GPIO_TB0219) += tb0219.o obj-$(CONFIG_TELCLOCK) += tlclk.o +obj-$(CONFIG_UMMUNOT) += ummunot.o obj-$(CONFIG_MWAVE)+= mwave/ obj-$(CONFIG_AGP) += agp/ diff --git a/drivers/char/ummunot.c b/drivers/char/ummunot.c new file mode 100644 index 000..ebfd038 --- /dev/null +++ b/drivers/char/ummunot.c @@ -0,0 +1,469 @@ +/* + * Copyright (c) 2009 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenFabrics BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include linux/fs.h +#include linux/init.h +#include linux/list.h +#include
Re: [ofa-general] Memory registration redux
Or, ignore the overlapping problem, and use your original technique, slightly modified: - Userspace registers a counter with the kernel. Kernel pins the page, sets up mmu notifiers and increments the counter when invalidates intersect with registrations - Kernel maintains a linked list of registrations that have been invalidated via mmu notifiers using the registration structure and a dirty bit - Userspace checks the counter at every cache hit, if different it calls into the kernel: MR_Cookie *mrs[100]; int rc = ibv_get_invalid_mrs(mrs,100); invalidate_cache(mrs,rc); // Repeat until drained get_invalid_mrs traverses the linked list and returns an identifying value to userspace, which looks it up in the cache, calls unregister and removes it from the cache. What's the advantage of this? I have to do the get_invalid_mrs() call a bunch of times, rather than just reading which ones are invalid from the cache directly? This is a trade off, the above is a more normal kernel API and lets the app get an list of changes it can scan. Having the kernel update flags means if the app wants a list of changes it has to scan all registrations. The more I thought about this, the more I liked the idea, until I liked it so much that I actually went ahead and prototyped this. A preliminary version is below -- *very* lightly tested, and no doubt there are obvious bugs that any real use or review will uncover. But I thought I'd throw it out and hope for comments and/or testing. I'm actually pretty happy with how small and simple this ended up being. I'll reply to this message with a simple test program I've used to sanity check this. === [PATCH] ummunot: Userspace support for MMU notifications As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925 and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunot, that creates a /dev/ummunot node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunot_register_ioctl in linux/ummunot.h). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunot_event in linux/ummunot.h). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. NOT-Signed-off-by: Roland Dreier rola...@cisco.com --- drivers/char/Kconfig| 12 ++ drivers/char/Makefile |1 + drivers/char/ummunot.c | 457 +++ include/linux/ummunot.h | 85 + 4 files changed, 555 insertions(+), 0 deletions(-) create mode 100644 drivers/char/ummunot.c create mode 100644 include/linux/ummunot.h diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 735bbe2..91fe068 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -1099,6 +1099,18 @@ config DEVPORT depends on ISA || PCI default y +config UMMUNOT + tristate Userspace MMU notifications + select MMU_NOTIFIER + help + The ummunot (userspace MMU notification) driver creates a + character device that can be used by userspace libraries to + get notifications when an application's memory mapping + changed. This is used, for example, by RDMA libraries to + improve the reliability of memory registration caching, since + the kernel's MMU notifications can be used to know precisely + when to shoot down a cached registration. + source drivers/s390/char/Kconfig endmenu diff --git a/drivers/char/Makefile b/drivers/char/Makefile index 9caf5b5..dcbcd7c 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -97,6 +97,7 @@ obj-$(CONFIG_CS5535_GPIO) += cs5535_gpio.o obj-$(CONFIG_GPIO_VR41XX) += vr41xx_giu.o obj-$(CONFIG_GPIO_TB0219) += tb0219.o obj-$(CONFIG_TELCLOCK) += tlclk.o +obj-$(CONFIG_UMMUNOT) += ummunot.o obj-$(CONFIG_MWAVE)+= mwave/ obj-$(CONFIG_AGP) += agp/ diff --git a/drivers/char/ummunot.c
Re: [ofa-general] Memory registration redux
Here's the test program: #include fcntl.h #include stdio.h #include unistd.h #include linux/types.h #include linux/ioctl.h #include sys/mman.h #include sys/stat.h #include sys/types.h #define UMMUNOT_INTF_VERSION1 enum { UMMUNOT_EVENT_TYPE_INVAL= 0, UMMUNOT_EVENT_TYPE_LAST = 1, }; enum { UMMUNOT_EVENT_FLAG_HINT = 1 0, }; /* * If type field is INVAL, then user_cookie_counter holds the * user_cookie for the region being reported; if the HINT flag is set * then hint_start/hint_end hold the start and end of the mapping that * was invalidated. (If HINT is not set, then multiple events * invalidated parts of the registered range and hint_start/hint_end * should be ignored) * * If type is LAST, then the read operation has emptied the list of * invalidated regions, and user_cookie_counter holds the value of the * kernel's generation counter when the empty list occurred. The * other fields are not filled in for this event. */ struct ummunot_event { __u32 type; __u32 flags; __u64 hint_start; __u64 hint_end; __u64 user_cookie_counter; }; struct ummunot_register_ioctl { __u32 intf_version; /* in */ __u32 reserved1; __u64 start; /* in */ __u64 end;/* in */ __u64 user_cookie;/* in */ }; #define UMMUNOT_MAGIC 'U' #define UMMUNOT_REGISTER_REGION _IOWR(UMMUNOT_MAGIC, 1, \ struct ummunot_register_ioctl) #define UMMUNOT_UNREGISTER_REGION _IOW(UMMUNOT_MAGIC, 2, __u64) static int umn_fd; static volatile unsigned long long *umn_counter; static int umn_init(void) { umn_fd = open(/dev/ummunot, O_RDONLY); if (umn_fd 0) { perror(open); return 1; } umn_counter = mmap(NULL, sizeof *umn_counter, PROT_READ, MAP_SHARED, umn_fd, 0); if (umn_counter == MAP_FAILED) { perror(mmap); return 1; } return 0; } static int umn_register(void *buf, size_t size, __u64 cookie) { struct ummunot_register_ioctl r = { .intf_version = UMMUNOT_INTF_VERSION, .start = (unsigned long) buf, .end= (unsigned long) buf + size, .user_cookie= cookie, }; if (ioctl(umn_fd, UMMUNOT_REGISTER_REGION, r)) { perror(ioctl); return 1; } return 0; } static int umn_unregister(__u64 cookie) { if (ioctl(umn_fd, UMMUNOT_UNREGISTER_REGION, cookie)) { perror(ioctl); return 1; } return 0; } int main(int argc, char *argv[]) { int page_size = sysconf(_SC_PAGESIZE); void *t; if (umn_init()) return 1; if (*umn_counter != 0) { fprintf(stderr, counter = %lld (expected 0)\n, *umn_counter); return 1; } t = mmap(NULL, 3 * page_size, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0); if (umn_register(t, 3 * page_size, 123)) return 1; munmap(t + page_size, page_size); printf(ummunot events: %lld\n, *umn_counter); if (*umn_counter 0) { struct ummunot_event ev[2]; int len; int i; len = read(umn_fd, ev, sizeof ev); printf(read %d events (%d tot)\n, len / sizeof ev[0], len); for (i = 0; i len / sizeof ev[0]; ++i) { switch (ev[i].type) { case UMMUNOT_EVENT_TYPE_INVAL: printf([%3d]: inval cookie %lld\n, i, ev[i].user_cookie_counter); if (ev[i].flags UMMUNOT_EVENT_FLAG_HINT) printf( hint %llx...%lx\n, ev[i].hint_start, ev[i].hint_end); break; case UMMUNOT_EVENT_TYPE_LAST: printf([%3d]: empty up to %lld\n, i, ev[i].user_cookie_counter); break; default: printf([%3d]: unknown event type %d\n, i, ev[i].type); break; } } } umn_unregister(123); munmap(t, page_size); printf(ummunot events: %lld\n, *umn_counter); return 0; } ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
Re: [ofa-general] Memory registration redux
On Tue, May 26, 2009 at 04:13:08PM -0700, Roland Dreier wrote: Or, ignore the overlapping problem, and use your original technique, slightly modified: - Userspace registers a counter with the kernel. Kernel pins the page, sets up mmu notifiers and increments the counter when invalidates intersect with registrations - Kernel maintains a linked list of registrations that have been invalidated via mmu notifiers using the registration structure and a dirty bit - Userspace checks the counter at every cache hit, if different it calls into the kernel: MR_Cookie *mrs[100]; int rc = ibv_get_invalid_mrs(mrs,100); invalidate_cache(mrs,rc); // Repeat until drained get_invalid_mrs traverses the linked list and returns an identifying value to userspace, which looks it up in the cache, calls unregister and removes it from the cache. What's the advantage of this? I have to do the get_invalid_mrs() call a bunch of times, rather than just reading which ones are invalid from the cache directly? This is a trade off, the above is a more normal kernel API and lets the app get an list of changes it can scan. Having the kernel update flags means if the app wants a list of changes it has to scan all registrations. The more I thought about this, the more I liked the idea, until I liked it so much that I actually went ahead and prototyped this. A preliminary version is below -- *very* lightly tested, and no doubt there are obvious bugs that any real use or review will uncover. But I thought I'd throw it out and hope for comments and/or testing. I'm actually pretty happy with how small and simple this ended up being. Seems reasonable to me. This doesn't catch all mmap cases, ie this kind of stuff: t = mmap(NULL, 3 * page_size, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0); if (umn_register(t, 3 * page_size, 123)) return 1; t = mmap(t,page_size,PROT_READ,MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,-1,0); // Event? Probably munmap(t,page_size); // Event? No, no MAP_POPULATE t = mmap(t,page_size,PROT_READ,MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,-1,0); // Event? No And I guess the use of MAP_POPULATE is deliberate as thats how mmu notifier works.. So the use model for a MPI would be to call ibv_register/umn_register and watch for events. Any event at all means the entire region is toast and must be re-registered the next time someone calls with that address. ibv_register does the same as MAP_POPULATE internally.. The MPI library uses the result of this to build a list of invalided regions. From time to time the MPI library should unregister those regions. If that is the use then the kernel side should probably also be a one-shot type of interface.. I'm also trying to think of a use case outside of RDMA and failing - if the kernel hasn't pinned the pages being watched through some other means it seems useless as a general feature?? Jason ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
On May 18, 2009, at 5:15 PM, Roland Dreier (rdreier) wrote: So you want the registration cache to be reference counted per-page? Seems like potentially a lot of overhead -- if someone registers a million pages, then to check for a cache hit, you have to potentially check millions of reference counts. Our caches are hash tables of balanced red-black trees. So in practice, we won't be trolling through anywhere near a million reference counts to find a hit. Hang on. The whole point of MR caching is exactly that you don't unregister a memory region, even after you're done using the memory it covers, in the hope that you'll want to reuse that registration. And the whole point of this thread is that an application can then free() some of the memory that is still registered in the cache. Sorry -- the implication that I took from Caitlyn's text was that the memory was *used* after it was freed. That is clearly erroneous. What OMPI does (and apparently other MPI's do) is simply invalidate any registration for free'd memory. Additionally, we won't unregister memory while there is at least one use of it outstanding (that MPI knows about, such as a pending non-blocking communication). We lazily unregister just for exactly the case you're talking about (might want to use it for verbs communication again later). Per my prior mail, Open MPI registers chucks at a time. Each chunk is potentially a multiple of pages. So yes, you could end up having a single registration that spans the buffers used in multiple, distinct MPI sends. We reference count by page to ensure that deregistrations do not occur prematurely. Hmm, I'm worried that the exact semantics of the memory cache seem to be tied into how the MPI implementation is registering memory. Open MPI happens to work in small chunks (I guess) and so your cache is tailored for that use case. I know the original proposal was an attempt to come up with something that all the MPIs can agree on, but it didn't cover the full semantics, at least not for cases like the overlapping sub-registrations that we're discussing here. Is there still one set of semantics everyone can agree on? So just to be clear -- let's separate the two issues that are evolving from this thread: 1. fix the hole where memory returned to the OS cannot be guaranteed to be caught by userspace (and therefore may still stay registered and/ or invalidate userspace registration cache entries) 2. have libibverbs include some form of memory registration caching (potentially using the solution to #1 to know when to invalidate reg. cache entries) Personally, I would prioritize them in the issues in this order. Did a solution for #1 get agreed upon? I admit that I got lost in the kernel discussion of issues between you, Jason, etc. Agreeing on registration caching semantics may take a little more discussion (although, as someone pointed out earlier, if libibverbs' reg caching is optional, then the verbs-based app can choose to use it or their own scheme). -- Jeff Squyres Cisco Systems ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
On May 7, 2009, at 5:58 PM, Roland Dreier (rdreier) wrote: Specifically: the actual dereg of 0x1000-0x3fff is blocked on also releasing 0x2000-0x2fff. If everyone is doing this, how do you handle the case that Jason pointed out, namely: * you register 0x1000 ... 0x3fff * you want to register 0x2000 ... 0x2fff and have a cache hit * you finish up with 0x1000 ... 0x3fff * app does something (which is valid since you finished up with the bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg free() that leads to munmap() or whatever), and your hooks tell you so. * app reallocates a mapping in 0x3000 ... 0x3fff * you want to re-register 0x1000 ... 0x3fff -- but it has to be marked both invalid and in-use in the cache at this point !? Sorry; this mail slipped by me and I just saw it now. If this can actually happen -- that the mapping of 0x1000 ... 0x3fff can change even though it is still registered, then we're screwed -- we have no way of knowing that this is now invalid (Open MPI, at least -- can't speak for others). Is there a way to detect condition this in userspace? -- Jeff Squyres Cisco Systems ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
On Mon, May 18, 2009 at 9:24 AM, Jeff Squyres jsquy...@cisco.com wrote: On May 7, 2009, at 5:58 PM, Roland Dreier (rdreier) wrote:  Specifically: the actual dereg of 0x1000-0x3fff is blocked on also  releasing 0x2000-0x2fff. If everyone is doing this, how do you handle the case that Jason pointed out, namely:  * you register 0x1000 ... 0x3fff  * you want to register 0x2000 ... 0x2fff and have a cache hit  * you finish up with 0x1000 ... 0x3fff  * app does something (which is valid since you finished up with the  bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg free()  that leads to munmap() or whatever), and your hooks tell you so.  * app reallocates a mapping in 0x3000 ... 0x3fff  * you want to re-register 0x1000 ... 0x3fff -- but it has to be marked  both invalid and in-use in the cache at this point !? Sorry; this mail slipped by me and I just saw it now. If this can actually happen -- that the mapping of 0x1000 ... 0x3fff can change even though it is still registered, then we're screwed -- we have no way of knowing that this is now invalid (Open MPI, at least -- can't speak for others). Is there a way to detect condition this in userspace? How does 0x1000 to 0x3fff get registered as a single Memory Region? If it is legitimate to free() 0x3000..0x3fff then how can there ever be a legitimate reference to 0x1000..0x3fff? If there is no such single reference, I don't see how a Memory Region is every created covering that range. If the user creates the Memory Region, then they are responsible for not free()ing a portion of it. Would the MPI library ever create a single large memory region based on two distinct Sends? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
On May 18, 2009, at 2:02 PM, Caitlin Bestler wrote: Specifically: the actual dereg of 0x1000-0x3fff is blocked on also releasing 0x2000-0x2fff. If everyone is doing this, how do you handle the case that Jason pointed out, namely: * you register 0x1000 ... 0x3fff * you want to register 0x2000 ... 0x2fff and have a cache hit * you finish up with 0x1000 ... 0x3fff * app does something (which is valid since you finished up with the bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg free() that leads to munmap() or whatever), and your hooks tell you so. * app reallocates a mapping in 0x3000 ... 0x3fff * you want to re-register 0x1000 ... 0x3fff -- but it has to be marked both invalid and in-use in the cache at this point !? I think I mis-parsed the above scenario in my previous response. When our memory hooks tell us that memory is about to be removed from the process, we unregister all pages in the relevant region and remove those entries from the cache. So the next time you look in the cache for 0x3000-0x3fff, it won't be there -- it'll be treated as cache-cold. How does 0x1000 to 0x3fff get registered as a single Memory Region? If it is legitimate to free() 0x3000..0x3fff then how can there ever be a legitimate reference to 0x1000..0x3fff? If there is no such single reference, I don't see how a Memory Region is every created covering that range. If the user creates the Memory Region, then they are responsible for not free()ing a portion of it. Agreed. If an application does that, it deserves what it gets. Would the MPI library ever create a single large memory region based on two distinct Sends? Per my prior mail, Open MPI registers chucks at a time. Each chunk is potentially a multiple of pages. So yes, you could end up having a single registration that spans the buffers used in multiple, distinct MPI sends. We reference count by page to ensure that deregistrations do not occur prematurely. For example, if page X contains the end of one large buffer and the beginning of another, both of which are being used in ongoing non- blocking MPI communications. Then page X's entry on our cache will have a refcount == 2. OMPI won't allow the registration containing that page to become eligible for deregistering until the cache entry's refcount goes down to 0. See my prior mail for a more complex example of our cache's behavior. -- Jeff Squyres Cisco Systems ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
When our memory hooks tell us that memory is about to be removed from the process, we unregister all pages in the relevant region and remove those entries from the cache. So the next time you look in the cache for 0x3000-0x3fff, it won't be there -- it'll be treated as cache-cold. So you want the registration cache to be reference counted per-page? Seems like potentially a lot of overhead -- if someone registers a million pages, then to check for a cache hit, you have to potentially check millions of reference counts. How does 0x1000 to 0x3fff get registered as a single Memory Region? If it is legitimate to free() 0x3000..0x3fff then how can there ever be a legitimate reference to 0x1000..0x3fff? If there is no such single reference, I don't see how a Memory Region is every created covering that range. If the user creates the Memory Region, then they are responsible for not free()ing a portion of it. Agreed. If an application does that, it deserves what it gets. Hang on. The whole point of MR caching is exactly that you don't unregister a memory region, even after you're done using the memory it covers, in the hope that you'll want to reuse that registration. And the whole point of this thread is that an application can then free() some of the memory that is still registered in the cache. Per my prior mail, Open MPI registers chucks at a time. Each chunk is potentially a multiple of pages. So yes, you could end up having a single registration that spans the buffers used in multiple, distinct MPI sends. We reference count by page to ensure that deregistrations do not occur prematurely. Hmm, I'm worried that the exact semantics of the memory cache seem to be tied into how the MPI implementation is registering memory. Open MPI happens to work in small chunks (I guess) and so your cache is tailored for that use case. I know the original proposal was an attempt to come up with something that all the MPIs can agree on, but it didn't cover the full semantics, at least not for cases like the overlapping sub-registrations that we're discussing here. Is there still one set of semantics everyone can agree on? - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
On Tue, May 05, 2009 at 04:57:09PM -0400, Jeff Squyres wrote: Roland and I chatted on the phone today; I think I now understand Roland's counter-proposal (I clearly didn't before). Let me try to summarize: 1. Add a new verb for set this userspace flag to 1 if mr X ever becomes invalid 2. Add a new verb for no longer tell me if mr X ever becomes invalid (i.e., remove the effects of #1) 3. Add run-time query indicating whether #1 works 4. Add [optional] memory registration caching to libibverbs Prior to talking to Roland, I had envisioned *one* flag in userspace that indicated whether any memory registrations had become invalid. Roland's idea is that there is one flag *per registration* -- you can instantly tell whether a specific registration is valid. Given this, let's keep the discussion going here in email -- perhaps the teleconference next Monday may become moot. It looks like there has been more discussion on how to implement this idea. Are we still planning on having this teleconference today? -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo pgpWUb6R9iQyz.pgp Description: PGP signature ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
On Thu, May 7, 2009 at 3:48 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: Right, I was only thinking of a new driver call that was along the lines of update_mr_pages() that just updates the HCA's mapping with new page table entires atomically. It really would be device specific. If there is no call available then unregister/register + printk log is a fair generic implementation. To be clear, what I'm thinking is that this would only be invoked if Both the IBTA and RDMAC verbs were defined so that the meaning of L-Key/R-Key/STag + Address could not instantly change from X to Y, only from X to NULL and then NULL to Y. There are a lot of good reasons for this, especially for R-Keys or remotely accessible STags. It ensures that all operations that started when the translation was X are completed before any that will use the Y translation can commence. That is not something we want to accidentally undermine. There really isn't a reason why this rule needed to apply to entire Memory Regions. So I don't see a problem with allowing an update_mr_pages() verb that changes a portion of an MR map, perhaps by optimal machine specific hooks when available, without requiring the entire MR be specified. But it must preserve the guarantee that all operations initiated with translation X are completed before any operations for translation Y can be initiated. Preserving this guarantee should not be a problem for the free() then reallocate scenarios that have been discussed. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
On Mon, May 11, 2009 at 02:23:58PM -0700, Caitlin Bestler wrote: On Thu, May 7, 2009 at 3:48 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: Right, I was only thinking of a new driver call that was along the lines of update_mr_pages() that just updates the HCA's mapping with new page table entires atomically. It really would be device specific. If there is no call available then unregister/register + printk log is a fair generic implementation. To be clear, what I'm thinking is that this would only be invoked if Both the IBTA and RDMAC verbs were defined so that the meaning of L-Key/R-Key/STag + Address could not instantly change from X to Y, only from X to NULL and then NULL to Y. Well, this is sort of a grey area, in one sense the meaning isn't changing, just the underlying phyiscal memory is being moved around by the OS. The notion that the verbs refer to some sort of invisible underlying VM object is nice for an implementation but pretty useless for MPI.. There are a lot of good reasons for this, especially for R-Keys or remotely accessible STags. It ensures that all operations that started when the translation was X are completed before any that will use the Y translation can commence. That is not something we want to accidentally undermine. I'm not sure I see how this helps, synchronizing all this is the responsibility of the application, if it wants to change the mapping then it should be able to, and if it does so with poor timing then it will have races and loose data shrug. As it stands today there are already races where apps can loose data transfered after an unmap() or transfer the wrong data after a mmap() so the current model is already broken from that perspective. Of course an update verb has to operate with similar ordering guarantees to regsiter/unregister relative to the local work request queue - that is to say if the verb is done out-of-line with the WR queue then it must wait for the queue to flush before issuing the update to the HCA - just like unregister - and then wait for the verb to complete before returning to the app - just like register. And we all wish for userspace FRMRs... Jason ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote: By the way, what's the desired behavior of the cache if a process registers, say, address range 0x1000 ... 0x3fff, and then the same process registers address range 0x2000 ... 0x2fff (with all the same permissions, etc)? The initial registration creates an MR that is still valid for the smaller virtual address range, so the second registration is much cheaper if we used the cached registration; but if we use the cache for the second registration, and then deregister the first one, we're stuck with a too-big range pinned in the cache because of the second registration. I don't know what the other MPI's do in this scenario, but here's what OMPI will do: 1. lookup 0x1000-0x3fff in the cache; not find any of it it, and therefore register - add each page to our cache with a refcount of 1 2. lookup 0x2000-0x2fff in the cache, find that all the pages are already registered - refcount++ on each page in the cache 3. when we go to dereg 0x1000-0x3fff - refcount-- on each page in the cache - since some pages in the range still have refcount0, don't do anything further Specifically: the actual dereg of 0x1000-0x3fff is blocked on also releasing 0x2000-0x2fff. Note that OMPI will only register a max of X bytes at a time (where X defaults to 2MB). So even if a user calls MPI_SEND(...) with an enormous buffer, we'll register it X/page_size pages at a time, not the entire buffer at once. Hence, the buffer A is blocked from dereg'ing by buffer B scenario is *somewhat* mitigated -- it's less wasteful than if we can registered/cached the entire huge buffer at once. Finally, note that if 0x2000-0x2fff had not been registered, the 0x1000-0x3fff pages are not actually deregistered when all the pages' refcounts go to 0 -- they are just moved to the able to be dereg'ed list. We don't actually dereg it until we later try to reg new memory and fail due to lack of resources. Then we take entries off the able to be dereg'ed list and dereg them, then try reg'ing the new memory again. MVAPICH: do you guys do similar things? (I don't know if HP/Scali/Intel will comment on their registration cache schemes) -- Jeff Squyres Cisco Systems ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Memory registration redux
HP-MPI is pretty much doing the similar thing. --CQ -Original Message- From: general-boun...@lists.openfabrics.org [mailto:general-boun...@lists.openfabrics.org] On Behalf Of Jeff Squyres Sent: Thursday, May 07, 2009 8:54 AM To: Roland Dreier Cc: Pavel Shamis; Hans Westgaard Ry; Terry Dontje; Lenny Verkhovsky; HÃ¥kon Bugge; Donald Kerr; OpenFabrics General; Alexander Supalov Subject: Re: [ofa-general] Memory registration redux On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote: By the way, what's the desired behavior of the cache if a process registers, say, address range 0x1000 ... 0x3fff, and then the same process registers address range 0x2000 ... 0x2fff (with all the same permissions, etc)? The initial registration creates an MR that is still valid for the smaller virtual address range, so the second registration is much cheaper if we used the cached registration; but if we use the cache for the second registration, and then deregister the first one, we're stuck with a too-big range pinned in the cache because of the second registration. I don't know what the other MPI's do in this scenario, but here's what OMPI will do: 1. lookup 0x1000-0x3fff in the cache; not find any of it it, and therefore register - add each page to our cache with a refcount of 1 2. lookup 0x2000-0x2fff in the cache, find that all the pages are already registered - refcount++ on each page in the cache 3. when we go to dereg 0x1000-0x3fff - refcount-- on each page in the cache - since some pages in the range still have refcount0, don't do anything further Specifically: the actual dereg of 0x1000-0x3fff is blocked on also releasing 0x2000-0x2fff. Note that OMPI will only register a max of X bytes at a time (where X defaults to 2MB). So even if a user calls MPI_SEND(...) with an enormous buffer, we'll register it X/page_size pages at a time, not the entire buffer at once. Hence, the buffer A is blocked from dereg'ing by buffer B scenario is *somewhat* mitigated -- it's less wasteful than if we can registered/cached the entire huge buffer at once. Finally, note that if 0x2000-0x2fff had not been registered, the 0x1000-0x3fff pages are not actually deregistered when all the pages' refcounts go to 0 -- they are just moved to the able to be dereg'ed list. We don't actually dereg it until we later try to reg new memory and fail due to lack of resources. Then we take entries off the able to be dereg'ed list and dereg them, then try reg'ing the new memory again. MVAPICH: do you guys do similar things? (I don't know if HP/Scali/Intel will comment on their registration cache schemes) -- Jeff Squyres Cisco Systems ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Memory registration redux
MVAPICH is also doing pretty much the same thing as well. Matt On Thu, 7 May 2009, Tang, Changqing wrote: HP-MPI is pretty much doing the similar thing. --CQ -Original Message- From: general-boun...@lists.openfabrics.org [mailto:general-boun...@lists.openfabrics.org] On Behalf Of Jeff Squyres Sent: Thursday, May 07, 2009 8:54 AM To: Roland Dreier Cc: Pavel Shamis; Hans Westgaard Ry; Terry Dontje; Lenny Verkhovsky; H?kon Bugge; Donald Kerr; OpenFabrics General; Alexander Supalov Subject: Re: [ofa-general] Memory registration redux On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote: By the way, what's the desired behavior of the cache if a process registers, say, address range 0x1000 ... 0x3fff, and then the same process registers address range 0x2000 ... 0x2fff (with all the same permissions, etc)? The initial registration creates an MR that is still valid for the smaller virtual address range, so the second registration is much cheaper if we used the cached registration; but if we use the cache for the second registration, and then deregister the first one, we're stuck with a too-big range pinned in the cache because of the second registration. I don't know what the other MPI's do in this scenario, but here's what OMPI will do: 1. lookup 0x1000-0x3fff in the cache; not find any of it it, and therefore register - add each page to our cache with a refcount of 1 2. lookup 0x2000-0x2fff in the cache, find that all the pages are already registered - refcount++ on each page in the cache 3. when we go to dereg 0x1000-0x3fff - refcount-- on each page in the cache - since some pages in the range still have refcount0, don't do anything further Specifically: the actual dereg of 0x1000-0x3fff is blocked on also releasing 0x2000-0x2fff. Note that OMPI will only register a max of X bytes at a time (where X defaults to 2MB). So even if a user calls MPI_SEND(...) with an enormous buffer, we'll register it X/page_size pages at a time, not the entire buffer at once. Hence, the buffer A is blocked from dereg'ing by buffer B scenario is *somewhat* mitigated -- it's less wasteful than if we can registered/cached the entire huge buffer at once. Finally, note that if 0x2000-0x2fff had not been registered, the 0x1000-0x3fff pages are not actually deregistered when all the pages' refcounts go to 0 -- they are just moved to the able to be dereg'ed list. We don't actually dereg it until we later try to reg new memory and fail due to lack of resources. Then we take entries off the able to be dereg'ed list and dereg them, then try reg'ing the new memory again. MVAPICH: do you guys do similar things? (I don't know if HP/Scali/Intel will comment on their registration cache schemes) -- Jeff Squyres Cisco Systems ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
No... every HCA just needs to support register and unregister. It doesn't have to support changing the mapping without full unregister and reregister. Well, I would imagine this entire process to be a HCA specific operation, so HW that supports a better method can use it, otherwise it has to register/unregister. Is this a concern today with existing HCAs? Using register/unregister exposes a race for the original case you brought up - but that race is completely unfixable without hardware support. At least it now becomes a hw specific race that can be printk'd and someday fixed in new HW rather than an unsolvable API problem.. We definitely don't want to duplicate all this logic in every hardware device driver, so most of it needs to be generic. If we're adding new low-level driver methods to handle this, that definitely raises the cost of implementing all this. But I guess if we start with a generic register/unregister fallback that drivers can override for better performance, then I think we're in good shape. Also this requires potentially walking the page tables of the entire process, checking to see if any mappings have changed. We really want to keep the information that the MMU notifiers give us, namely which virtual address range is changing. Walking the page tables of every registration in the process, not the entire process. Yes... but there are bugs in the bugzilla about mthca being limited to only 8 GB of registration by default or something like that, and having that break Intel MPI in some cases. So some MPI jobs want to have 10s of GBs of registered memory -- walking millions of page table entries for every resync operation seems like a big problem to me. Which means that the MMU notifier has to walk the list of memory registrations and mark any affected ones as dirty (possibly with a hint about which pages were invalidated) as you suggest below. Falling back to the check every registration ultra-slow-path I think should never ever happen. I was thinking more along the lines of having the mmu notifiers put affected registrations on a per-process (or PD?) dirty linked list, with the link pointers as part of the registration structure. Set a dirty flag in the registration too. An extra pointer per registration and a minor incremental cost to the existing work the mmu notifier would have to do. Yes, makes sense. Only part I don't immediately see is how to trap creation of new VM (ie mmap), mmu notifiers seem focused on invalidating, ie munmap().. Why do we care? The initial faulting in of mappings occurs when an MR is created. Well, exactly, that's the problem. If you can't trap mmap you cannot do the initial faulting and mapping for a new object that is being mapped into an existing MR. Consider: void *a = mmap(0,PAGE_SIZE..); ibv_register(); // [..] mmunmap(a); ibv_synchronize(); // At this point we want the HCA mapping to point to oblivion mmap(a,PAGE_SIZE,MAP_FIXED); ibv_synchronize(); // And now we want it to point to the new allocation I use MAP_FIXED to illustrate the point, but Jeff has said the same address re-use happens randomly in real apps. This can be handled I think, although at some cost. Just have the kernel keep track of which MMU sequence number actually invalidated each MR, and return (via ibv_synchronize()) the MMU change sequence number that userspace is in sync with. So in the example above, the first synchronize after munmap() will fail to fix up the first registration, since it is pointing to an unmapped virtual address, and hence it will leave that MR on the dirty list, and return that sequence number as not being synced up yet. And then the second synchronize will see that MR still on the dirty list, and try again to find the pages. Passing the sequence number back to userspace makes it possible for userspace to know that it still has to call ibv_synchronize() again. There is the possibility that a 1GB MR will have its last page unmapped, and end up having 100s of thousands of pages walked again and again in every synchronize operation. This method avoids the problem you noticed, but there is extra work to fixup a registration that may never be used again. I strongly suspect that in the majority of cases this extra work should be about on the same order as userspace calling unregister on the MR. Yes, also it doesn't match the current MPI way of lazily unregistering things, and only garbage collecting the refcnt 0 cache entries when a registration fails. With this method, if userspace unregisters something, it really is gone, and if it doesn't unregister it, then it really uses up space until userspace explicitly unregisters it. Not sure how MPI implementers feel about that. Or, ignore the overlapping problem, and use your original technique, slightly modified: - Userspace
Re: [ofa-general] Memory registration redux
I don't know what the other MPI's do in this scenario, but here's what OMPI will do: 1. lookup 0x1000-0x3fff in the cache; not find any of it it, and therefore register - add each page to our cache with a refcount of 1 2. lookup 0x2000-0x2fff in the cache, find that all the pages are already registered - refcount++ on each page in the cache 3. when we go to dereg 0x1000-0x3fff - refcount-- on each page in the cache - since some pages in the range still have refcount0, don't do anything further Specifically: the actual dereg of 0x1000-0x3fff is blocked on also releasing 0x2000-0x2fff. If everyone is doing this, how do you handle the case that Jason pointed out, namely: * you register 0x1000 ... 0x3fff * you want to register 0x2000 ... 0x2fff and have a cache hit * you finish up with 0x1000 ... 0x3fff * app does something (which is valid since you finished up with the bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg free() that leads to munmap() or whatever), and your hooks tell you so. * app reallocates a mapping in 0x3000 ... 0x3fff * you want to re-register 0x1000 ... 0x3fff -- but it has to be marked both invalid and in-use in the cache at this point !? - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
On Thu, May 07, 2009 at 02:46:55PM -0700, Roland Dreier wrote: Using register/unregister exposes a race for the original case you brought up - but that race is completely unfixable without hardware support. At least it now becomes a hw specific race that can be printk'd and someday fixed in new HW rather than an unsolvable API problem.. We definitely don't want to duplicate all this logic in every hardware device driver, so most of it needs to be generic. If we're adding new low-level driver methods to handle this, that definitely raises the cost of implementing all this. But I guess if we start with a generic register/unregister fallback that drivers can override for better performance, then I think we're in good shape. Right, I was only thinking of a new driver call that was along the lines of update_mr_pages() that just updates the HCA's mapping with new page table entires atomically. It really would be device specific. If there is no call available then unregister/register + printk log is a fair generic implementation. To be clear, what I'm thinking is that this would only be invoked if the VM is being *replaced*. Simply unmaping VM should do nothing. Which means that the MMU notifier has to walk the list of memory registrations and mark any affected ones as dirty (possibly with a hint about which pages were invalidated) as you suggest below. Falling back to the check every registration ultra-slow-path I think should never ever happen. Yikes, yes, that makes sense. And hearing that at least openmpi caps the registration size makes me think per-page granularity is probably unnecessary to track. Well, exactly, that's the problem. If you can't trap mmap you cannot do the initial faulting and mapping for a new object that is being mapped into an existing MR. Consider: void *a = mmap(0,PAGE_SIZE..); ibv_register(); // [..] mmunmap(a); ibv_synchronize(); // At this point we want the HCA mapping to point to oblivion mmap(a,PAGE_SIZE,MAP_FIXED); ibv_synchronize(); // And now we want it to point to the new allocation I use MAP_FIXED to illustrate the point, but Jeff has said the same address re-use happens randomly in real apps. This can be handled I think, although at some cost. Just have the kernel keep track of which MMU sequence number actually invalidated each MR, and return (via ibv_synchronize()) the MMU change sequence number that userspace is in sync with. So in the example above, the first synchronize after munmap() will fail to fix up the first registration, since it is pointing to an unmapped virtual address, and hence it will leave that MR on the dirty list, and return that sequence number as not being synced up yet. And then the second synchronize will see that MR still on the dirty list, and try again to find the pages. I agree some kind of kernel/userspace exchange of the sequence number is necessary to make all the locking and race conditions work out. But the problem I'm seeing is how does the sequence number get incremented by the kernel after the mmap() call in the above sequence? Which mmu_notifier/etc call back do you hook for that? The *very best* hook would be one that is called when a mm has new virtual address space allocated and the verbs layer would then take the allocated address range and intersect it with the registration list. Any registrations that have pages in the allocated region are marked invalid. Imagine every call to ibv_synchronize was prefixed with a check that the sequence number is changed. This method avoids the problem you noticed, but there is extra work to fixup a registration that may never be used again. I strongly suspect that in the majority of cases this extra work should be about on the same order as userspace calling unregister on the MR. Yes, also it doesn't match the current MPI way of lazily unregistering things, and only garbage collecting the refcnt 0 cache entries when a registration fails. With this method, if userspace unregisters something, it really is gone, and if it doesn't unregister it, then it really uses up space until userspace explicitly unregisters it. Not sure how MPI implementers feel about that. Well, mixing the lazy unregister in is not a significant change, just don't increment the sequence number on munmap and have the kernel do nothing until pages are mapped into an existing registration. With a flag both behaviors are possible. All of this work is mainly to close the hole where mapping new memory over already registered VM results in RDMA to the wrong pages. Fixing this hole removes the need to trap memory management syscalls and solves that data corruption problem. From there various optimizations can be done, like lazy garbage collecting registrations that no longer point to mapped memory. Or, ignore the overlapping problem, and use your original technique,
Re: [ofa-general] Memory registration redux
Jeff Squyres wrote: Roland and I chatted on the phone today; I think I now understand Roland's counter-proposal (I clearly didn't before). Let me try to summarize: 1. Add a new verb for set this userspace flag to 1 if mr X ever becomes invalid 2. Add a new verb for no longer tell me if mr X ever becomes invalid (i.e., remove the effects of #1) 3. Add run-time query indicating whether #1 works 4. Add [optional] memory registration caching to libibverbs Prior to talking to Roland, I had envisioned *one* flag in userspace that indicated whether any memory registrations had become invalid. Roland's idea is that there is one flag *per registration* -- you can instantly tell whether a specific registration is valid. Given this, let's keep the discussion going here in email -- perhaps the teleconference next Monday may become moot. I think the new proposal is good (but I am not MPI expert) If we implement it soon we will be able to enable it in OFED 1.5 too I think the cache in libibverbs can be delayed since it can be added after the API will the kernel is avilable Tziporet ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
On May 6, 2009, at 10:09 AM, Tziporet Koren wrote: I think the new proposal is good (but I am not MPI expert) If we implement it soon we will be able to enable it in OFED 1.5 too That sounds good, as long as we don't diverge from upstream (like what happened with XRC). I think the cache in libibverbs can be delayed since it can be added after the API will the kernel is avilable Fair enough. -- Jeff Squyres Cisco Systems ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
Roland and I chatted on the phone today; I think I now understand Roland's counter-proposal (I clearly didn't before). Let me try to summarize: 1. Add a new verb for set this userspace flag to 1 if mr X ever becomes invalid 2. Add a new verb for no longer tell me if mr X ever becomes invalid (i.e., remove the effects of #1) 3. Add run-time query indicating whether #1 works 4. Add [optional] memory registration caching to libibverbs Looking closer at how to actually implement this, I see that the MMU notifiers (cf linux/mmu_notifier.h) may be called with locks held, so the kernel can't do a put_user() or the equivalent from the notifier. Therefore I think the interface we would expose to userspace would be something more like mmap() on some special file to get some kernel memory mapped into userspace, and then ioctl() to register/unregister a set this flag if address range X...Y is affected. To be honest I don't really love this idea -- the kernel still needs a fairly complicated data structure to efficiently track the address ranges being tracked, the size of the mmap() limits the number of ranges being tracked based on a static limit set at initialization time (or handling multiple maps gets still more complex), and there is some careful thinking required to make sure there are no memory ordering or cache aliasing issues. So then I thought some about how to implement the full MR cache in the kernel. And that fairly quickly gets into some complex stuff as well -- for example, since we can't take sleeping locks from MMU notifiers, but we can't hold non-sleeping locks across MR register operations, we need to drop our MR cache lock while registering things, which forces us to deal with rolling back registrations if we miss the cache initially but then find that another thread has already added a registration to the cache while we were trying to register the same memory. Keeping the actual MR caching in userspace does seem to make things simpler because the locking is much easier without having to worry about sleeping vs. non-sleeping locks. Also doing the cache in userspace with my flag idea above has the nice property that the fast path of hitting the cache on memory registration has no system call and in fact testing the flag may even be a CPU cache hit if memory registration is a hot enough path. Doing it in the kernel means even the best case has a system call -- which is very cheap with current CPUs but still a non-zero cost. So I'm really not sure what the right way to go is yet. Further opinions would be helpful. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
By the way, what's the desired behavior of the cache if a process registers, say, address range 0x1000 ... 0x3fff, and then the same process registers address range 0x2000 ... 0x2fff (with all the same permissions, etc)? The initial registration creates an MR that is still valid for the smaller virtual address range, so the second registration is much cheaper if we used the cached registration; but if we use the cache for the second registration, and then deregister the first one, we're stuck with a too-big range pinned in the cache because of the second registration. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
On Wed, May 06, 2009 at 01:10:47PM -0700, Roland Dreier wrote: By the way, what's the desired behavior of the cache if a process registers, say, address range 0x1000 ... 0x3fff, and then the same process registers address range 0x2000 ... 0x2fff (with all the same permissions, etc)? The initial registration creates an MR that is still valid for the smaller virtual address range, so the second registration is much cheaper if we used the cached registration; but if we use the cache for the second registration, and then deregister the first one, we're stuck with a too-big range pinned in the cache because of the second registration. Yuk, doesn't this problem pretty much doom this method entirely? You can't tear down the entire registration of 0x1000 ... 0x3fff if the app does something to change 0x2000 .. 0x2fff because it may have active RDMAs going on in 0x1000 ... 0x1fff. The above could happen through strange use of brk. What about a slightly different twist.. Instead of trying to make everything synchronous in the mmu_notifier, just have a counter mapped to user space. Increment the counter whenever the mms change from the notifier. Pin the user page that contains the single counter upon starting the process so access is lockless. In user space, check the counter before every cache lookup and if it has changed call back into the kernel to resynchronize the MR tables in the HCA to the current VM. Avoids the locking and racing problems, keeps the fast path in the user space and avoids the above question about how to deal with arbitrary actions? Jason ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
Yuk, doesn't this problem pretty much doom this method entirely? You can't tear down the entire registration of 0x1000 ... 0x3fff if the app does something to change 0x2000 .. 0x2fff because it may have active RDMAs going on in 0x1000 ... 0x1fff. Yes, I guess if we try to reuse registrations like this then we run into trouble. I think your example points to a problem if an app registers 0x1000...0x3fff and then we reuse that registration for 0x2000...0x2fff and also for 0x1000...0x1fff, and then the app unregisters 0x1000...0x3fff. But we can get around this just by not ever reusing registrations that way -- only treat something as a cache hit if it matches the start and length exactly. What about a slightly different twist.. Instead of trying to make everything synchronous in the mmu_notifier, just have a counter mapped to user space. Increment the counter whenever the mms change from the notifier. Pin the user page that contains the single counter upon starting the process so access is lockless. In user space, check the counter before every cache lookup and if it has changed call back into the kernel to resynchronize the MR tables in the HCA to the current VM. Avoids the locking and racing problems, keeps the fast path in the user space and avoids the above question about how to deal with arbitrary actions? I like the simplicity of the fast path. But it seems the slow path would be hard -- how exactly did you envision resynchronizing the MR tables? (Considering that RDMAs might be in flight for MRs that weren't changed by the MM operations) - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
On Wed, May 06, 2009 at 02:56:25PM -0700, Roland Dreier wrote: Yuk, doesn't this problem pretty much doom this method entirely? You can't tear down the entire registration of 0x1000 ... 0x3fff if the app does something to change 0x2000 .. 0x2fff because it may have active RDMAs going on in 0x1000 ... 0x1fff. Yes, I guess if we try to reuse registrations like this then we run into trouble. I think your example points to a problem if an app registers 0x1000...0x3fff and then we reuse that registration for 0x2000...0x2fff and also for 0x1000...0x1fff, and then the app unregisters 0x1000...0x3fff. But we can get around this just by not ever reusing registrations that way -- only treat something as a cache hit if it matches the start and length exactly. I can't comment on that, but it feels to me like a reasonable MPI use model would be to do small IOs randomly from the same allocation, and pre-hint to the library it wants that whole area cached in one shot. What about a slightly different twist.. Instead of trying to make everything synchronous in the mmu_notifier, just have a counter mapped to user space. Increment the counter whenever the mms change from the notifier. Pin the user page that contains the single counter upon starting the process so access is lockless. In user space, check the counter before every cache lookup and if it has changed call back into the kernel to resynchronize the MR tables in the HCA to the current VM. Avoids the locking and racing problems, keeps the fast path in the user space and avoids the above question about how to deal with arbitrary actions? I like the simplicity of the fast path. But it seems the slow path would be hard -- how exactly did you envision resynchronizing the MR tables? (Considering that RDMAs might be in flight for MRs that weren't changed by the MM operations) Well, this conceptually doesn't seem hard. Go through all the pages in the MR, if any have changed then pin the new page and replace the pages physical address in the HCA's page table. Once done, synchronize with the hardware, then run through again and un-pin and release all the replaced pages. Every HCA must have the necessary primitives for this to support register and unregister... An RDMA that is in progress to any page that is replaced is a 'use after free' type programming error. (And this means certain wacky uses, like using MAP_FIXED on memory that is under active RDMA, would be unsupported without an additional call) Doing this on a page by page basis rather than on a registration by registration basis is granular enough to avoid the problem you noticed. The mmu notifiers can optionally make note of the affected pages during the callback to reduce the workload of the syscall. Only part I don't immediately see is how to trap creation of new VM (ie mmap), mmu notifiers seem focused on invalidating, ie munmap().. Jason ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
Well, this conceptually doesn't seem hard. Go through all the pages in the MR, if any have changed then pin the new page and replace the pages physical address in the HCA's page table. Once done, synchronize with the hardware, then run through again and un-pin and release all the replaced pages. Every HCA must have the necessary primitives for this to support register and unregister... No... every HCA just needs to support register and unregister. It doesn't have to support changing the mapping without full unregister and reregister. Also this requires potentially walking the page tables of the entire process, checking to see if any mappings have changed. We really want to keep the information that the MMU notifiers give us, namely which virtual address range is changing. The mmu notifiers can optionally make note of the affected pages during the callback to reduce the workload of the syscall. This requires an unbounded amount of events to be queued up in the kernel, naively. (If we lose some events then we have to go back to the full page table scan, which I don't think is feasible) Only part I don't immediately see is how to trap creation of new VM (ie mmap), mmu notifiers seem focused on invalidating, ie munmap().. Why do we care? The initial faulting in of mappings occurs when an MR is created. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Memory registration redux
On Wed, May 06, 2009 at 03:39:54PM -0700, Roland Dreier wrote: Well, this conceptually doesn't seem hard. Go through all the pages in the MR, if any have changed then pin the new page and replace the pages physical address in the HCA's page table. Once done, synchronize with the hardware, then run through again and un-pin and release all the replaced pages. Every HCA must have the necessary primitives for this to support register and unregister... No... every HCA just needs to support register and unregister. It doesn't have to support changing the mapping without full unregister and reregister. Well, I would imagine this entire process to be a HCA specific operation, so HW that supports a better method can use it, otherwise it has to register/unregister. Is this a concern today with existing HCAs? Using register/unregister exposes a race for the original case you brought up - but that race is completely unfixable without hardware support. At least it now becomes a hw specific race that can be printk'd and someday fixed in new HW rather than an unsolvable API problem.. Also this requires potentially walking the page tables of the entire process, checking to see if any mappings have changed. We really want to keep the information that the MMU notifiers give us, namely which virtual address range is changing. Walking the page tables of every registration in the process, not the entire process. The mmu notifiers can optionally make note of the affected pages during the callback to reduce the workload of the syscall. This requires an unbounded amount of events to be queued up in the kernel, naively. (If we lose some events then we have to go back to the full page table scan, which I don't think is feasible) I was thinking more along the lines of having the mmu notifiers put affected registrations on a per-process (or PD?) dirty linked list, with the link pointers as part of the registration structure. Set a dirty flag in the registration too. An extra pointer per registration and a minor incremental cost to the existing work the mmu notifier would have to do. Only part I don't immediately see is how to trap creation of new VM (ie mmap), mmu notifiers seem focused on invalidating, ie munmap().. Why do we care? The initial faulting in of mappings occurs when an MR is created. Well, exactly, that's the problem. If you can't trap mmap you cannot do the initial faulting and mapping for a new object that is being mapped into an existing MR. Consider: void *a = mmap(0,PAGE_SIZE..); ibv_register(); // [..] mmunmap(a); ibv_synchronize(); // At this point we want the HCA mapping to point to oblivion mmap(a,PAGE_SIZE,MAP_FIXED); ibv_synchronize(); // And now we want it to point to the new allocation I use MAP_FIXED to illustrate the point, but Jeff has said the same address re-use happens randomly in real apps. This is the main deviation from your original idea, instead of having a granular notification to userspace to unregister a region, the kernel just goes and fixes it up so the existing registration still works. This method avoids the problem you noticed, but there is extra work to fixup a registration that may never be used again. I strongly suspect that in the majority of cases this extra work should be about on the same order as userspace calling unregister on the MR. Or, ignore the overlapping problem, and use your original technique, slightly modified: - Userspace registers a counter with the kernel. Kernel pins the page, sets up mmu notifiers and increments the counter when invalidates intersect with registrations - Kernel maintains a linked list of registrations that have been invalidated via mmu notifiers using the registration structure and a dirty bit - Userspace checks the counter at every cache hit, if different it calls into the kernel: MR_Cookie *mrs[100]; int rc = ibv_get_invalid_mrs(mrs,100); invalidate_cache(mrs,rc); // Repeat until drained get_invalid_mrs traverses the linked list and returns an identifying value to userspace, which looks it up in the cache, calls unregister and removes it from the cache. Jason ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general