RE: [ofa-general] Memory registration redux

2009-06-16 Thread Woodruff, Robert J
Hay Roland,

One question from my MPI guys.  Looks like you have added the ability
to have more than one version of the device to allow future versions,
i.e., the .intf_version in the register call. 

struct ummunot_register_ioctl r = {
.intf_version   = UMMUNOT_INTF_VERSION,
.start  = (unsigned long) buf,
.end= (unsigned long) buf + size,
.user_cookie= cookie,
};


I assume there will be some ioctl to allow a program to discover at runtime
the version(s) of the device that are supported on a particular system ?

woody
 

-Original Message-
From: general-boun...@lists.openfabrics.org 
[mailto:general-boun...@lists.openfabrics.org] On Behalf Of Roland Dreier
Sent: Tuesday, May 26, 2009 4:14 PM
To: Jason Gunthorpe
Cc: Pavel Shamis; Hans Westgaard Ry; Dontje; Lenny Verkhovsky; H??kon Bugge; 
Donald Kerr; OpenFabrics General; Supalov, Alexander
Subject: Re: [ofa-general] Memory registration redux

Here's the test program:

#include fcntl.h
#include stdio.h
#include unistd.h
#include linux/types.h
#include linux/ioctl.h
#include sys/mman.h
#include sys/stat.h
#include sys/types.h

#define UMMUNOT_INTF_VERSION1

enum {
UMMUNOT_EVENT_TYPE_INVAL= 0,
UMMUNOT_EVENT_TYPE_LAST = 1,
};

enum {
UMMUNOT_EVENT_FLAG_HINT = 1  0,
};

/*
 * If type field is INVAL, then user_cookie_counter holds the
 * user_cookie for the region being reported; if the HINT flag is set
 * then hint_start/hint_end hold the start and end of the mapping that
 * was invalidated.  (If HINT is not set, then multiple events
 * invalidated parts of the registered range and hint_start/hint_end
 * should be ignored)
 *
 * If type is LAST, then the read operation has emptied the list of
 * invalidated regions, and user_cookie_counter holds the value of the
 * kernel's generation counter when the empty list occurred.  The
 * other fields are not filled in for this event.
 */
struct ummunot_event {
__u32   type;
__u32   flags;
__u64   hint_start;
__u64   hint_end;
__u64   user_cookie_counter;
};

struct ummunot_register_ioctl {
__u32   intf_version;   /* in */
__u32   reserved1;
__u64   start;  /* in */
__u64   end;/* in */
__u64   user_cookie;/* in */
};

#define UMMUNOT_MAGIC   'U'

#define UMMUNOT_REGISTER_REGION _IOWR(UMMUNOT_MAGIC, 1, \
  struct ummunot_register_ioctl)
#define UMMUNOT_UNREGISTER_REGION   _IOW(UMMUNOT_MAGIC, 2, __u64)

static int umn_fd;
static volatile unsigned long long *umn_counter;

static int umn_init(void)
{
umn_fd = open(/dev/ummunot, O_RDONLY);
if (umn_fd  0) {
perror(open);
return 1;
}

umn_counter = mmap(NULL, sizeof *umn_counter, PROT_READ,
   MAP_SHARED, umn_fd, 0);
if (umn_counter == MAP_FAILED) {
perror(mmap);
return 1;
}

return 0;
}

static int umn_register(void *buf, size_t size, __u64 cookie)
{
struct ummunot_register_ioctl r = {
.intf_version   = UMMUNOT_INTF_VERSION,
.start  = (unsigned long) buf,
.end= (unsigned long) buf + size,
.user_cookie= cookie,
};

if (ioctl(umn_fd, UMMUNOT_REGISTER_REGION, r)) {
perror(ioctl);
return 1;
}

return 0;
}

static int umn_unregister(__u64 cookie)
{
if (ioctl(umn_fd, UMMUNOT_UNREGISTER_REGION, cookie)) {
perror(ioctl);
return 1;
}

return 0;
}

int main(int argc, char *argv[])
{
int page_size = sysconf(_SC_PAGESIZE);
void *t;

if (umn_init())
return 1;

if (*umn_counter != 0) {
fprintf(stderr, counter = %lld (expected 0)\n, *umn_counter);
return 1;
}

t = mmap(NULL, 3 * page_size, PROT_READ,
 MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);

if (umn_register(t, 3 * page_size, 123))
return 1;

munmap(t + page_size, page_size);

printf(ummunot events: %lld\n, *umn_counter);

if (*umn_counter  0) {
struct ummunot_event ev[2];
int len;
int i;

len = read(umn_fd, ev, sizeof ev);
printf(read %d events (%d tot)\n, len / sizeof ev[0], len);

for (i = 0; i  len / sizeof ev[0]; ++i) {
switch (ev[i].type) {
case UMMUNOT_EVENT_TYPE_INVAL:
printf([%3d]: inval cookie %lld\n,
   i, ev[i].user_cookie_counter

Re: [ofa-general] Memory registration redux

2009-06-16 Thread Roland Dreier

  I assume there will be some ioctl to allow a program to discover at runtime
  the version(s) of the device that are supported on a particular system ?

Yeah, I guess.  I haven't really thought through the forwards compat
completely I guess.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] Memory registration redux

2009-06-08 Thread Supalov, Alexander
Hi,

Intel MPI developers are in principle OK with this proposal. What way of 
delivery is envisioned? Will this become a part of OFED or of the mainstream 
kernel? How fast will it spread? Are there any comparable Windows plans?

Best regards.

Alexander 

-Original Message-
From: Supalov, Alexander 
Sent: Wednesday, June 03, 2009 12:26 PM
To: 'Roland Dreier'
Cc: Jeff Squyres; Pavel Shamis; Hans Westgaard Ry; Dontje; Lenny Verkhovsky; 
H??kon Bugge; Donald Kerr; OpenFabrics General
Subject: RE: [ofa-general] Memory registration redux

Thanks. This is what I was looking for. Let me pass this by the key Intel MPI 
developers and get back to you.

-Original Message-
From: Roland Dreier [mailto:rdre...@cisco.com] 
Sent: Tuesday, June 02, 2009 6:45 PM
To: Supalov, Alexander
Cc: Jeff Squyres; Pavel Shamis; Hans Westgaard Ry; Dontje; Lenny Verkhovsky; 
H??kon Bugge; Donald Kerr; OpenFabrics General
Subject: Re: [ofa-general] Memory registration redux


  Sorry, it's kind of difficult to deduce looking at this QA sequence
  what works how and when. Is it possible to create a brief and direct
  description of the proposed solution?

Did you see the original patch description I sent:

As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts.  Specifically, we implement a
new character device driver, ummunot, that creates a /dev/ummunot
node.  A userspace process can open this node read-only and use the fd
as follows:

 1. ioctl() to register/unregister an address range to watch in the
kernel (cf struct ummunot_register_ioctl in linux/ummunot.h).

 2. read() to retrieve events generated when a mapping in a watched
address range is invalidated (cf struct ummunot_event in
linux/ummunot.h).  select()/poll()/epoll() and SIGIO are handled
for this IO.

 3. mmap() one page at offset 0 to map a kernel page that contains a
generation counter that is incremented each time an event is
generated.  This allows userspace to have a fast path that checks
that no events have occurred without a system call.
-
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-06-08 Thread Tziporet Koren

Supalov, Alexander wrote:

Hi,

Intel MPI developers are in principle OK with this proposal. What way of delivery is envisioned? Will this become a part of OFED or of the mainstream kernel? 

Roland is planing to push it to kernel 2.6.31
And OFED will take it from the kernel.
We will check if we can do backports for distros. I assume it will be 
available only for distros that have the MMU notifiers in the kernel.

How fast will it spread? Are there any comparable Windows plans?

  

I cannot answer on Windows

Tziporet

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] Memory registration redux

2009-06-08 Thread Sean Hefty
Are there any comparable Windows plans?

I believe that Windows already provides an equivalent functionality as part of
the OS (Windows 2008 / Vista).  See SecureMemoryCacheCallback.  There are no
plans for WinOF to provide anything separately from this.  (It's likely
impossible without OS support anyway.)

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] Memory registration redux

2009-06-03 Thread Supalov, Alexander
Thanks. This is what I was looking for. Let me pass this by the key Intel MPI 
developers and get back to you.

-Original Message-
From: Roland Dreier [mailto:rdre...@cisco.com] 
Sent: Tuesday, June 02, 2009 6:45 PM
To: Supalov, Alexander
Cc: Jeff Squyres; Pavel Shamis; Hans Westgaard Ry; Dontje; Lenny Verkhovsky; 
H??kon Bugge; Donald Kerr; OpenFabrics General
Subject: Re: [ofa-general] Memory registration redux


  Sorry, it's kind of difficult to deduce looking at this QA sequence
  what works how and when. Is it possible to create a brief and direct
  description of the proposed solution?

Did you see the original patch description I sent:

As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts.  Specifically, we implement a
new character device driver, ummunot, that creates a /dev/ummunot
node.  A userspace process can open this node read-only and use the fd
as follows:

 1. ioctl() to register/unregister an address range to watch in the
kernel (cf struct ummunot_register_ioctl in linux/ummunot.h).

 2. read() to retrieve events generated when a mapping in a watched
address range is invalidated (cf struct ummunot_event in
linux/ummunot.h).  select()/poll()/epoll() and SIGIO are handled
for this IO.

 3. mmap() one page at offset 0 to map a kernel page that contains a
generation counter that is incremented each time an event is
generated.  This allows userspace to have a fast path that checks
that no events have occurred without a system call.
-
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] Memory registration redux

2009-06-02 Thread Supalov, Alexander
Hi,

Sorry, it's kind of difficult to deduce looking at this QA sequence what works 
how and when. Is it possible to create a brief and direct description of the 
proposed solution?

Best regards.

Alexander

-Original Message-
From: Jeff Squyres [mailto:jsquy...@cisco.com] 
Sent: Wednesday, May 27, 2009 9:03 PM
To: Roland Dreier (rdreier)
Cc: Pavel Shamis; Hans Westgaard Ry; Dontje; Lenny Verkhovsky; H??kon Bugge; 
Donald Kerr; OpenFabrics General; Supalov, Alexander
Subject: Re: [ofa-general] Memory registration redux

Other MPI implementors -- what do you think of this scheme?


On May 27, 2009, at 1:49 PM, Roland Dreier (rdreier) wrote:


/*
 * If type field is INVAL, then user_cookie_counter holds the
 * user_cookie for the region being reported; if the HINT flag  
 is set
 * then hint_start/hint_end hold the start and end of the  
 mapping that
 * was invalidated.  (If HINT is not set, then multiple events
 * invalidated parts of the registered range and hint_start/ 
 hint_end
 * should be ignored)

   I don't quite grok this.  Is the intent that HINT will only be  
 set if
   an *entire* hint_start/hint_end range is invalidated by a single
   event?  I.e., if only part of the hint_start/hint_end range is
   invalidated, you'll get the cookie back, but not what part of the
   range is invalid (because assumedly the entire IBV registration  
 is now
   invalid anyway)?

 Basically, I just keep one hint_start/hint_end.  If multiple events  
 hit
 the same registration then I just give up and don't give you a hint.

 * If type is LAST, then the read operation has emptied the  
 list of
 * invalidated regions, and user_cookie_counter holds the value  
 of the
 * kernel's generation counter when the empty list occurred.  The
 * other fields are not filled in for this event.

   Just to be clear -- we're supposed to keep reading events until  
 we get
   a LAST event?

 Yes, that's probably the sanest use case.

   1. Will it increase by 1 each time a page (or set of pages?) is
   removed from a user process?

 As it stands it increases by 1 every time there is an MMU  
 notification,
 even if that notification hits multiple registrations.  It wouldn't be
 hard to change that to count the number of events generated if that
 works better.

   2. Does it change if pages are *added* to a user process?  I.e.,  
 does
   the counter indicate *removals* or *changes* to the user process  
 page
   table?

 No, additions don't trigger any MMU notification -- that's inherent in
 the design of the MMU notifiers stuff.  The idea is that you have a
 secondary MMU and MMU notifications are the equivalent of TLB
 shootdowns; the secondary MMU is responsible for populating itself on
 faults etc.

   Is the *unm_counter value guaranteed to have been changed by the  
 time
   munmap() returns?

 Yes.

   Did you pick [2] here simply because you're only expecting an INVAL
   and a LAST event in this specific example?  I'm assuming that we
   should normally loop over reading until we get LAST, correct?

 Right.

   What happens if I register multiple regions with the same cookie  
 value?

 You get in trouble -- I need to fix things to reject duplicated  
 cookies
 actually, because otherwise there's no way to unregister.

   Is a process responsible for guaranteeing that it umn_unregister()s
   everything before exiting, or will all pending registrations be
   cleaned up/unregistered/whatever when a process exits?

 The kernel cleans up of course to handle crashes etc.

  - R.



-- 
Jeff Squyres
Cisco Systems

-
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-06-02 Thread Roland Dreier

  Sorry, it's kind of difficult to deduce looking at this QA sequence
  what works how and when. Is it possible to create a brief and direct
  description of the proposed solution?

Did you see the original patch description I sent:

As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts.  Specifically, we implement a
new character device driver, ummunot, that creates a /dev/ummunot
node.  A userspace process can open this node read-only and use the fd
as follows:

 1. ioctl() to register/unregister an address range to watch in the
kernel (cf struct ummunot_register_ioctl in linux/ummunot.h).

 2. read() to retrieve events generated when a mapping in a watched
address range is invalidated (cf struct ummunot_event in
linux/ummunot.h).  select()/poll()/epoll() and SIGIO are handled
for this IO.

 3. mmap() one page at offset 0 to map a kernel page that contains a
generation counter that is incremented each time an event is
generated.  This allows userspace to have a fast path that checks
that no events have occurred without a system call.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-29 Thread Hans Westgaard Ry

The scheme looks fine to me !

Hans W. Ry

Jeff Squyres skrev:

Other MPI implementors -- what do you think of this scheme?


On May 27, 2009, at 1:49 PM, Roland Dreier (rdreier) wrote:



   /*
* If type field is INVAL, then user_cookie_counter holds the
* user_cookie for the region being reported; if the HINT flag 
is set
* then hint_start/hint_end hold the start and end of the 
mapping that

* was invalidated.  (If HINT is not set, then multiple events
* invalidated parts of the registered range and 
hint_start/hint_end

* should be ignored)

  I don't quite grok this.  Is the intent that HINT will only be set if
  an *entire* hint_start/hint_end range is invalidated by a single
  event?  I.e., if only part of the hint_start/hint_end range is
  invalidated, you'll get the cookie back, but not what part of the
  range is invalid (because assumedly the entire IBV registration is 
now

  invalid anyway)?

Basically, I just keep one hint_start/hint_end.  If multiple events hit
the same registration then I just give up and don't give you a hint.

* If type is LAST, then the read operation has emptied the list of
* invalidated regions, and user_cookie_counter holds the value 
of the

* kernel's generation counter when the empty list occurred.  The
* other fields are not filled in for this event.

  Just to be clear -- we're supposed to keep reading events until we 
get

  a LAST event?

Yes, that's probably the sanest use case.

  1. Will it increase by 1 each time a page (or set of pages?) is
  removed from a user process?

As it stands it increases by 1 every time there is an MMU notification,
even if that notification hits multiple registrations.  It wouldn't be
hard to change that to count the number of events generated if that
works better.

  2. Does it change if pages are *added* to a user process?  I.e., does
  the counter indicate *removals* or *changes* to the user process page
  table?

No, additions don't trigger any MMU notification -- that's inherent in
the design of the MMU notifiers stuff.  The idea is that you have a
secondary MMU and MMU notifications are the equivalent of TLB
shootdowns; the secondary MMU is responsible for populating itself on
faults etc.

  Is the *unm_counter value guaranteed to have been changed by the time
  munmap() returns?

Yes.

  Did you pick [2] here simply because you're only expecting an INVAL
  and a LAST event in this specific example?  I'm assuming that we
  should normally loop over reading until we get LAST, correct?

Right.

  What happens if I register multiple regions with the same cookie 
value?


You get in trouble -- I need to fix things to reject duplicated cookies
actually, because otherwise there's no way to unregister.

  Is a process responsible for guaranteeing that it umn_unregister()s
  everything before exiting, or will all pending registrations be
  cleaned up/unregistered/whatever when a process exits?

The kernel cleans up of course to handle crashes etc.

 - R.






___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-28 Thread Pavel Shamis (Pasha)

Sounds good for me,

Jeff Squyres wrote:

Other MPI implementors -- what do you think of this scheme?


On May 27, 2009, at 1:49 PM, Roland Dreier (rdreier) wrote:



   /*
* If type field is INVAL, then user_cookie_counter holds the
* user_cookie for the region being reported; if the HINT flag 
is set
* then hint_start/hint_end hold the start and end of the 
mapping that

* was invalidated.  (If HINT is not set, then multiple events
* invalidated parts of the registered range and 
hint_start/hint_end

* should be ignored)

  I don't quite grok this.  Is the intent that HINT will only be set if
  an *entire* hint_start/hint_end range is invalidated by a single
  event?  I.e., if only part of the hint_start/hint_end range is
  invalidated, you'll get the cookie back, but not what part of the
  range is invalid (because assumedly the entire IBV registration is 
now

  invalid anyway)?

Basically, I just keep one hint_start/hint_end.  If multiple events hit
the same registration then I just give up and don't give you a hint.

* If type is LAST, then the read operation has emptied the list of
* invalidated regions, and user_cookie_counter holds the value 
of the

* kernel's generation counter when the empty list occurred.  The
* other fields are not filled in for this event.

  Just to be clear -- we're supposed to keep reading events until we 
get

  a LAST event?

Yes, that's probably the sanest use case.

  1. Will it increase by 1 each time a page (or set of pages?) is
  removed from a user process?

As it stands it increases by 1 every time there is an MMU notification,
even if that notification hits multiple registrations.  It wouldn't be
hard to change that to count the number of events generated if that
works better.

  2. Does it change if pages are *added* to a user process?  I.e., does
  the counter indicate *removals* or *changes* to the user process page
  table?

No, additions don't trigger any MMU notification -- that's inherent in
the design of the MMU notifiers stuff.  The idea is that you have a
secondary MMU and MMU notifications are the equivalent of TLB
shootdowns; the secondary MMU is responsible for populating itself on
faults etc.

  Is the *unm_counter value guaranteed to have been changed by the time
  munmap() returns?

Yes.

  Did you pick [2] here simply because you're only expecting an INVAL
  and a LAST event in this specific example?  I'm assuming that we
  should normally loop over reading until we get LAST, correct?

Right.

  What happens if I register multiple regions with the same cookie 
value?


You get in trouble -- I need to fix things to reject duplicated cookies
actually, because otherwise there's no way to unregister.

  Is a process responsible for guaranteeing that it umn_unregister()s
  everything before exiting, or will all pending registrations be
  cleaned up/unregistered/whatever when a process exits?

The kernel cleans up of course to handle crashes etc.

 - R.






___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-27 Thread Jeff Squyres

On May 26, 2009, at 7:13 PM, Roland Dreier (rdreier) wrote:


/*
 * If type field is INVAL, then user_cookie_counter holds the
 * user_cookie for the region being reported; if the HINT flag is set
 * then hint_start/hint_end hold the start and end of the mapping that
 * was invalidated.  (If HINT is not set, then multiple events
 * invalidated parts of the registered range and hint_start/hint_end
 * should be ignored)



I don't quite grok this.  Is the intent that HINT will only be set if  
an *entire* hint_start/hint_end range is invalidated by a single  
event?  I.e., if only part of the hint_start/hint_end range is  
invalidated, you'll get the cookie back, but not what part of the  
range is invalid (because assumedly the entire IBV registration is now  
invalid anyway)?



 * If type is LAST, then the read operation has emptied the list of
 * invalidated regions, and user_cookie_counter holds the value of the
 * kernel's generation counter when the empty list occurred.  The
 * other fields are not filled in for this event.



Just to be clear -- we're supposed to keep reading events until we get  
a LAST event?



if (*umn_counter != 0) {
fprintf(stderr, counter = %lld (expected 0)\n,  
*umn_counter);

return 1;
}



Some clarification questions about umn_counter:

1. Will it increase by 1 each time a page (or set of pages?) is  
removed from a user process?


2. Does it change if pages are *added* to a user process?  I.e., does  
the counter indicate *removals* or *changes* to the user process page  
table?



t = mmap(NULL, 3 * page_size, PROT_READ,
 MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);

if (umn_register(t, 3 * page_size, 123))
return 1;

munmap(t + page_size, page_size);

printf(ummunot events: %lld\n, *umn_counter);

if (*umn_counter  0) {



Is the *unm_counter value guaranteed to have been changed by the time  
munmap() returns?



struct ummunot_event ev[2];



Did you pick [2] here simply because you're only expecting an INVAL  
and a LAST event in this specific example?  I'm assuming that we  
should normally loop over reading until we get LAST, correct?




int len;
int i;

len = read(umn_fd, ev, sizeof ev);
printf(read %d events (%d tot)\n, len / sizeof  
ev[0], len);


for (i = 0; i  len / sizeof ev[0]; ++i) {
switch (ev[i].type) {
case UMMUNOT_EVENT_TYPE_INVAL:
printf([%3d]: inval cookie %lld\n,
   i, ev[i].user_cookie_counter);
if (ev[i].flags   
UMMUNOT_EVENT_FLAG_HINT)

printf(  hint %llx...%lx\n,
   ev[i].hint_start,  
ev[i].hint_end);

break;
case UMMUNOT_EVENT_TYPE_LAST:
printf([%3d]: empty up to %lld\n,
   i, ev[i].user_cookie_counter);
break;
default:
printf([%3d]: unknown event type %d 
\n,

   i, ev[i].type);
break;
}
}
}

umn_unregister(123);



What happens if I register multiple regions with the same cookie value?

Is a process responsible for guaranteeing that it umn_unregister()s  
everything before exiting, or will all pending registrations be  
cleaned up/unregistered/whatever when a process exits?


--
Jeff Squyres
Cisco Systems

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-27 Thread Roland Dreier
   /*
* If type field is INVAL, then user_cookie_counter holds the
* user_cookie for the region being reported; if the HINT flag is set
* then hint_start/hint_end hold the start and end of the mapping that
* was invalidated.  (If HINT is not set, then multiple events
* invalidated parts of the registered range and hint_start/hint_end
* should be ignored)

  I don't quite grok this.  Is the intent that HINT will only be set if
  an *entire* hint_start/hint_end range is invalidated by a single
  event?  I.e., if only part of the hint_start/hint_end range is
  invalidated, you'll get the cookie back, but not what part of the
  range is invalid (because assumedly the entire IBV registration is now
  invalid anyway)?

Basically, I just keep one hint_start/hint_end.  If multiple events hit
the same registration then I just give up and don't give you a hint.

* If type is LAST, then the read operation has emptied the list of
* invalidated regions, and user_cookie_counter holds the value of the
* kernel's generation counter when the empty list occurred.  The
* other fields are not filled in for this event.

  Just to be clear -- we're supposed to keep reading events until we get
  a LAST event?

Yes, that's probably the sanest use case.

  1. Will it increase by 1 each time a page (or set of pages?) is
  removed from a user process?

As it stands it increases by 1 every time there is an MMU notification,
even if that notification hits multiple registrations.  It wouldn't be
hard to change that to count the number of events generated if that
works better.

  2. Does it change if pages are *added* to a user process?  I.e., does
  the counter indicate *removals* or *changes* to the user process page
  table?

No, additions don't trigger any MMU notification -- that's inherent in
the design of the MMU notifiers stuff.  The idea is that you have a
secondary MMU and MMU notifications are the equivalent of TLB
shootdowns; the secondary MMU is responsible for populating itself on
faults etc.

  Is the *unm_counter value guaranteed to have been changed by the time
  munmap() returns?

Yes.

  Did you pick [2] here simply because you're only expecting an INVAL
  and a LAST event in this specific example?  I'm assuming that we
  should normally loop over reading until we get LAST, correct?

Right.

  What happens if I register multiple regions with the same cookie value?

You get in trouble -- I need to fix things to reject duplicated cookies
actually, because otherwise there's no way to unregister.

  Is a process responsible for guaranteeing that it umn_unregister()s
  everything before exiting, or will all pending registrations be
  cleaned up/unregistered/whatever when a process exits?

The kernel cleans up of course to handle crashes etc.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-27 Thread Roland Dreier
Fixed version below -- returns EINVAL for an attempt to reuse a user
cookie (since otherwise unregister would get confused).

===

ummunot: Userspace support for MMU notifications

As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts.  Specifically, we implement a
new character device driver, ummunot, that creates a /dev/ummunot
node.  A userspace process can open this node read-only and use the fd
as follows:

 1. ioctl() to register/unregister an address range to watch in the
kernel (cf struct ummunot_register_ioctl in linux/ummunot.h).

 2. read() to retrieve events generated when a mapping in a watched
address range is invalidated (cf struct ummunot_event in
linux/ummunot.h).  select()/poll()/epoll() and SIGIO are handled
for this IO.

 3. mmap() one page at offset 0 to map a kernel page that contains a
generation counter that is incremented each time an event is
generated.  This allows userspace to have a fast path that checks
that no events have occurred without a system call.

NOT-YET-Signed-off-by: Roland Dreier rola...@cisco.com
---
 drivers/char/Kconfig|   12 ++
 drivers/char/Makefile   |1 +
 drivers/char/ummunot.c  |  457 +++
 include/linux/ummunot.h |   85 +
 4 files changed, 555 insertions(+), 0 deletions(-)
 create mode 100644 drivers/char/ummunot.c
 create mode 100644 include/linux/ummunot.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 735bbe2..91fe068 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -1099,6 +1099,18 @@ config DEVPORT
depends on ISA || PCI
default y
 
+config UMMUNOT
+   tristate Userspace MMU notifications
+   select MMU_NOTIFIER
+   help
+ The ummunot (userspace MMU notification) driver creates a
+ character device that can be used by userspace libraries to
+ get notifications when an application's memory mapping
+ changed.  This is used, for example, by RDMA libraries to
+ improve the reliability of memory registration caching, since
+ the kernel's MMU notifications can be used to know precisely
+ when to shoot down a cached registration.
+
 source drivers/s390/char/Kconfig
 
 endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 9caf5b5..dcbcd7c 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_CS5535_GPIO) += cs5535_gpio.o
 obj-$(CONFIG_GPIO_VR41XX)  += vr41xx_giu.o
 obj-$(CONFIG_GPIO_TB0219)  += tb0219.o
 obj-$(CONFIG_TELCLOCK) += tlclk.o
+obj-$(CONFIG_UMMUNOT)  += ummunot.o
 
 obj-$(CONFIG_MWAVE)+= mwave/
 obj-$(CONFIG_AGP)  += agp/
diff --git a/drivers/char/ummunot.c b/drivers/char/ummunot.c
new file mode 100644
index 000..1341edc
--- /dev/null
+++ b/drivers/char/ummunot.c
@@ -0,0 +1,457 @@
+/*
+ * Copyright (c) 2009 Cisco Systems.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenFabrics BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include linux/fs.h
+#include linux/init.h
+#include linux/list.h
+#include linux/miscdevice.h
+#include linux/mm.h
+#include 

Re: [ofa-general] Memory registration redux

2009-05-27 Thread Jeff Squyres

Other MPI implementors -- what do you think of this scheme?


On May 27, 2009, at 1:49 PM, Roland Dreier (rdreier) wrote:



   /*
* If type field is INVAL, then user_cookie_counter holds the
* user_cookie for the region being reported; if the HINT flag  
is set
* then hint_start/hint_end hold the start and end of the  
mapping that

* was invalidated.  (If HINT is not set, then multiple events
* invalidated parts of the registered range and hint_start/ 
hint_end

* should be ignored)

  I don't quite grok this.  Is the intent that HINT will only be  
set if

  an *entire* hint_start/hint_end range is invalidated by a single
  event?  I.e., if only part of the hint_start/hint_end range is
  invalidated, you'll get the cookie back, but not what part of the
  range is invalid (because assumedly the entire IBV registration  
is now

  invalid anyway)?

Basically, I just keep one hint_start/hint_end.  If multiple events  
hit

the same registration then I just give up and don't give you a hint.

* If type is LAST, then the read operation has emptied the  
list of
* invalidated regions, and user_cookie_counter holds the value  
of the

* kernel's generation counter when the empty list occurred.  The
* other fields are not filled in for this event.

  Just to be clear -- we're supposed to keep reading events until  
we get

  a LAST event?

Yes, that's probably the sanest use case.

  1. Will it increase by 1 each time a page (or set of pages?) is
  removed from a user process?

As it stands it increases by 1 every time there is an MMU  
notification,

even if that notification hits multiple registrations.  It wouldn't be
hard to change that to count the number of events generated if that
works better.

  2. Does it change if pages are *added* to a user process?  I.e.,  
does
  the counter indicate *removals* or *changes* to the user process  
page

  table?

No, additions don't trigger any MMU notification -- that's inherent in
the design of the MMU notifiers stuff.  The idea is that you have a
secondary MMU and MMU notifications are the equivalent of TLB
shootdowns; the secondary MMU is responsible for populating itself on
faults etc.

  Is the *unm_counter value guaranteed to have been changed by the  
time

  munmap() returns?

Yes.

  Did you pick [2] here simply because you're only expecting an INVAL
  and a LAST event in this specific example?  I'm assuming that we
  should normally loop over reading until we get LAST, correct?

Right.

  What happens if I register multiple regions with the same cookie  
value?


You get in trouble -- I need to fix things to reject duplicated  
cookies

actually, because otherwise there's no way to unregister.

  Is a process responsible for guaranteeing that it umn_unregister()s
  everything before exiting, or will all pending registrations be
  cleaned up/unregistered/whatever when a process exits?

The kernel cleans up of course to handle crashes etc.

 - R.




--
Jeff Squyres
Cisco Systems

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-27 Thread Roland Dreier
Sigh... real version that returns EINVAL for an attempt to reuse a user
cookie (since otherwise unregister would get confused).  Previous
posting was the old patch, sorry.

===

ummunot: Userspace support for MMU notifications

As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts.  Specifically, we implement a
new character device driver, ummunot, that creates a /dev/ummunot
node.  A userspace process can open this node read-only and use the fd
as follows:

 1. ioctl() to register/unregister an address range to watch in the
kernel (cf struct ummunot_register_ioctl in linux/ummunot.h).

 2. read() to retrieve events generated when a mapping in a watched
address range is invalidated (cf struct ummunot_event in
linux/ummunot.h).  select()/poll()/epoll() and SIGIO are handled
for this IO.

 3. mmap() one page at offset 0 to map a kernel page that contains a
generation counter that is incremented each time an event is
generated.  This allows userspace to have a fast path that checks
that no events have occurred without a system call.

Signed-off-by: Roland Dreier rola...@cisco.com
---
 drivers/char/Kconfig|   12 ++
 drivers/char/Makefile   |1 +
 drivers/char/ummunot.c  |  469 +++
 include/linux/ummunot.h |   85 +
 4 files changed, 567 insertions(+), 0 deletions(-)
 create mode 100644 drivers/char/ummunot.c
 create mode 100644 include/linux/ummunot.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 735bbe2..91fe068 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -1099,6 +1099,18 @@ config DEVPORT
depends on ISA || PCI
default y
 
+config UMMUNOT
+   tristate Userspace MMU notifications
+   select MMU_NOTIFIER
+   help
+ The ummunot (userspace MMU notification) driver creates a
+ character device that can be used by userspace libraries to
+ get notifications when an application's memory mapping
+ changed.  This is used, for example, by RDMA libraries to
+ improve the reliability of memory registration caching, since
+ the kernel's MMU notifications can be used to know precisely
+ when to shoot down a cached registration.
+
 source drivers/s390/char/Kconfig
 
 endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 9caf5b5..dcbcd7c 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_CS5535_GPIO) += cs5535_gpio.o
 obj-$(CONFIG_GPIO_VR41XX)  += vr41xx_giu.o
 obj-$(CONFIG_GPIO_TB0219)  += tb0219.o
 obj-$(CONFIG_TELCLOCK) += tlclk.o
+obj-$(CONFIG_UMMUNOT)  += ummunot.o
 
 obj-$(CONFIG_MWAVE)+= mwave/
 obj-$(CONFIG_AGP)  += agp/
diff --git a/drivers/char/ummunot.c b/drivers/char/ummunot.c
new file mode 100644
index 000..ebfd038
--- /dev/null
+++ b/drivers/char/ummunot.c
@@ -0,0 +1,469 @@
+/*
+ * Copyright (c) 2009 Cisco Systems.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenFabrics BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include linux/fs.h
+#include linux/init.h
+#include linux/list.h
+#include 

Re: [ofa-general] Memory registration redux

2009-05-26 Thread Roland Dreier

 Or, ignore the overlapping problem, and use your original technique,
 slightly modified:
  - Userspace registers a counter with the kernel. Kernel pins the
page, sets up mmu notifiers and increments the counter when
invalidates intersect with registrations
  - Kernel maintains a linked list of registrations that have been
invalidated via mmu notifiers using the registration structure
and a dirty bit
  - Userspace checks the counter at every cache hit, if different it
calls into the kernel:
MR_Cookie *mrs[100];
int rc = ibv_get_invalid_mrs(mrs,100);
invalidate_cache(mrs,rc);
// Repeat until drained
 
get_invalid_mrs traverses the linked list and returns an
identifying value to userspace, which looks it up in the cache,
calls unregister and removes it from the cache.
   
   What's the advantage of this?  I have to do the get_invalid_mrs() call a
   bunch of times, rather than just reading which ones are invalid from the
   cache directly?
  
  This is a trade off, the above is a more normal kernel API and lets
  the app get an list of changes it can scan. Having the kernel update
  flags means if the app wants a list of changes it has to scan all
  registrations.

The more I thought about this, the more I liked the idea, until I liked
it so much that I actually went ahead and prototyped this.  A
preliminary version is below -- *very* lightly tested, and no doubt
there are obvious bugs that any real use or review will uncover.  But I
thought I'd throw it out and hope for comments and/or testing.  I'm
actually pretty happy with how small and simple this ended up being.

I'll reply to this message with a simple test program I've used to
sanity check this.

===

[PATCH] ummunot: Userspace support for MMU notifications

As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts.  Specifically, we implement a
new character device driver, ummunot, that creates a /dev/ummunot
node.  A userspace process can open this node read-only and use the fd
as follows:

 1. ioctl() to register/unregister an address range to watch in the
kernel (cf struct ummunot_register_ioctl in linux/ummunot.h).

 2. read() to retrieve events generated when a mapping in a watched
address range is invalidated (cf struct ummunot_event in
linux/ummunot.h).  select()/poll()/epoll() and SIGIO are handled
for this IO.

 3. mmap() one page at offset 0 to map a kernel page that contains a
generation counter that is incremented each time an event is
generated.  This allows userspace to have a fast path that checks
that no events have occurred without a system call.

NOT-Signed-off-by: Roland Dreier rola...@cisco.com
---
 drivers/char/Kconfig|   12 ++
 drivers/char/Makefile   |1 +
 drivers/char/ummunot.c  |  457 +++
 include/linux/ummunot.h |   85 +
 4 files changed, 555 insertions(+), 0 deletions(-)
 create mode 100644 drivers/char/ummunot.c
 create mode 100644 include/linux/ummunot.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 735bbe2..91fe068 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -1099,6 +1099,18 @@ config DEVPORT
depends on ISA || PCI
default y
 
+config UMMUNOT
+   tristate Userspace MMU notifications
+   select MMU_NOTIFIER
+   help
+ The ummunot (userspace MMU notification) driver creates a
+ character device that can be used by userspace libraries to
+ get notifications when an application's memory mapping
+ changed.  This is used, for example, by RDMA libraries to
+ improve the reliability of memory registration caching, since
+ the kernel's MMU notifications can be used to know precisely
+ when to shoot down a cached registration.
+
 source drivers/s390/char/Kconfig
 
 endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 9caf5b5..dcbcd7c 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_CS5535_GPIO) += cs5535_gpio.o
 obj-$(CONFIG_GPIO_VR41XX)  += vr41xx_giu.o
 obj-$(CONFIG_GPIO_TB0219)  += tb0219.o
 obj-$(CONFIG_TELCLOCK) += tlclk.o
+obj-$(CONFIG_UMMUNOT)  += ummunot.o
 
 obj-$(CONFIG_MWAVE)+= mwave/
 obj-$(CONFIG_AGP)  += agp/
diff --git a/drivers/char/ummunot.c 

Re: [ofa-general] Memory registration redux

2009-05-26 Thread Roland Dreier
Here's the test program:

#include fcntl.h
#include stdio.h
#include unistd.h
#include linux/types.h
#include linux/ioctl.h
#include sys/mman.h
#include sys/stat.h
#include sys/types.h

#define UMMUNOT_INTF_VERSION1

enum {
UMMUNOT_EVENT_TYPE_INVAL= 0,
UMMUNOT_EVENT_TYPE_LAST = 1,
};

enum {
UMMUNOT_EVENT_FLAG_HINT = 1  0,
};

/*
 * If type field is INVAL, then user_cookie_counter holds the
 * user_cookie for the region being reported; if the HINT flag is set
 * then hint_start/hint_end hold the start and end of the mapping that
 * was invalidated.  (If HINT is not set, then multiple events
 * invalidated parts of the registered range and hint_start/hint_end
 * should be ignored)
 *
 * If type is LAST, then the read operation has emptied the list of
 * invalidated regions, and user_cookie_counter holds the value of the
 * kernel's generation counter when the empty list occurred.  The
 * other fields are not filled in for this event.
 */
struct ummunot_event {
__u32   type;
__u32   flags;
__u64   hint_start;
__u64   hint_end;
__u64   user_cookie_counter;
};

struct ummunot_register_ioctl {
__u32   intf_version;   /* in */
__u32   reserved1;
__u64   start;  /* in */
__u64   end;/* in */
__u64   user_cookie;/* in */
};

#define UMMUNOT_MAGIC   'U'

#define UMMUNOT_REGISTER_REGION _IOWR(UMMUNOT_MAGIC, 1, \
  struct ummunot_register_ioctl)
#define UMMUNOT_UNREGISTER_REGION   _IOW(UMMUNOT_MAGIC, 2, __u64)

static int umn_fd;
static volatile unsigned long long *umn_counter;

static int umn_init(void)
{
umn_fd = open(/dev/ummunot, O_RDONLY);
if (umn_fd  0) {
perror(open);
return 1;
}

umn_counter = mmap(NULL, sizeof *umn_counter, PROT_READ,
   MAP_SHARED, umn_fd, 0);
if (umn_counter == MAP_FAILED) {
perror(mmap);
return 1;
}

return 0;
}

static int umn_register(void *buf, size_t size, __u64 cookie)
{
struct ummunot_register_ioctl r = {
.intf_version   = UMMUNOT_INTF_VERSION,
.start  = (unsigned long) buf,
.end= (unsigned long) buf + size,
.user_cookie= cookie,
};

if (ioctl(umn_fd, UMMUNOT_REGISTER_REGION, r)) {
perror(ioctl);
return 1;
}

return 0;
}

static int umn_unregister(__u64 cookie)
{
if (ioctl(umn_fd, UMMUNOT_UNREGISTER_REGION, cookie)) {
perror(ioctl);
return 1;
}

return 0;
}

int main(int argc, char *argv[])
{
int page_size = sysconf(_SC_PAGESIZE);
void *t;

if (umn_init())
return 1;

if (*umn_counter != 0) {
fprintf(stderr, counter = %lld (expected 0)\n, *umn_counter);
return 1;
}

t = mmap(NULL, 3 * page_size, PROT_READ,
 MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);

if (umn_register(t, 3 * page_size, 123))
return 1;

munmap(t + page_size, page_size);

printf(ummunot events: %lld\n, *umn_counter);

if (*umn_counter  0) {
struct ummunot_event ev[2];
int len;
int i;

len = read(umn_fd, ev, sizeof ev);
printf(read %d events (%d tot)\n, len / sizeof ev[0], len);

for (i = 0; i  len / sizeof ev[0]; ++i) {
switch (ev[i].type) {
case UMMUNOT_EVENT_TYPE_INVAL:
printf([%3d]: inval cookie %lld\n,
   i, ev[i].user_cookie_counter);
if (ev[i].flags  UMMUNOT_EVENT_FLAG_HINT)
printf(  hint %llx...%lx\n,
   ev[i].hint_start, 
ev[i].hint_end);
break;
case UMMUNOT_EVENT_TYPE_LAST:
printf([%3d]: empty up to %lld\n,
   i, ev[i].user_cookie_counter);
break;
default:
printf([%3d]: unknown event type %d\n,
   i, ev[i].type);
break;
}
}
}

umn_unregister(123);
munmap(t, page_size);

printf(ummunot events: %lld\n, *umn_counter);

return 0;
}
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

Re: [ofa-general] Memory registration redux

2009-05-26 Thread Jason Gunthorpe
On Tue, May 26, 2009 at 04:13:08PM -0700, Roland Dreier wrote:

  Or, ignore the overlapping problem, and use your original technique,
  slightly modified:
   - Userspace registers a counter with the kernel. Kernel pins the
 page, sets up mmu notifiers and increments the counter when
 invalidates intersect with registrations
   - Kernel maintains a linked list of registrations that have been
 invalidated via mmu notifiers using the registration structure
 and a dirty bit
   - Userspace checks the counter at every cache hit, if different it
 calls into the kernel:
 MR_Cookie *mrs[100];
 int rc = ibv_get_invalid_mrs(mrs,100);
 invalidate_cache(mrs,rc);
 // Repeat until drained
  
 get_invalid_mrs traverses the linked list and returns an
 identifying value to userspace, which looks it up in the cache,
 calls unregister and removes it from the cache.

What's the advantage of this?  I have to do the get_invalid_mrs() call a
bunch of times, rather than just reading which ones are invalid from the
cache directly?
   
   This is a trade off, the above is a more normal kernel API and lets
   the app get an list of changes it can scan. Having the kernel update
   flags means if the app wants a list of changes it has to scan all
   registrations.
 
 The more I thought about this, the more I liked the idea, until I liked
 it so much that I actually went ahead and prototyped this.  A
 preliminary version is below -- *very* lightly tested, and no doubt
 there are obvious bugs that any real use or review will uncover.  But I
 thought I'd throw it out and hope for comments and/or testing.  I'm
 actually pretty happy with how small and simple this ended up being.

Seems reasonable to me. This doesn't catch all mmap cases, ie this
kind of stuff:

 t = mmap(NULL, 3 * page_size, PROT_READ,
 MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
 if (umn_register(t, 3 * page_size, 123))
return 1;

 t = mmap(t,page_size,PROT_READ,MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,-1,0);
 // Event? Probably

 munmap(t,page_size);
 // Event? No, no MAP_POPULATE

 t = mmap(t,page_size,PROT_READ,MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,-1,0);
 // Event? No

And I guess the use of MAP_POPULATE is deliberate as thats how mmu
notifier works..

So the use model for a MPI would be to call ibv_register/umn_register
and watch for events. Any event at all means the entire region is
toast and must be re-registered the next time someone calls with that
address. ibv_register does the same as MAP_POPULATE internally..

The MPI library uses the result of this to build a list of invalided
regions. From time to time the MPI library should unregister those
regions.

If that is the use then the kernel side should probably also be a
one-shot type of interface..

I'm also trying to think of a use case outside of RDMA and failing - if
the kernel hasn't pinned the pages being watched through some other
means it seems useless as a general feature??

Jason
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-19 Thread Jeff Squyres

On May 18, 2009, at 5:15 PM, Roland Dreier (rdreier) wrote:


So you want the registration cache to be reference counted per-page?
Seems like potentially a lot of overhead -- if someone registers a
million pages, then to check for a cache hit, you have to potentially
check millions of reference counts.



Our caches are hash tables of balanced red-black trees.  So in  
practice, we won't be trolling through anywhere near a million  
reference counts to find a hit.



Hang on.  The whole point of MR caching is exactly that you don't
unregister a memory region, even after you're done using the memory it
covers, in the hope that you'll want to reuse that registration.  And
the whole point of this thread is that an application can then free()
some of the memory that is still registered in the cache.



Sorry -- the implication that I took from Caitlyn's text was that the  
memory was *used* after it was freed.  That is clearly erroneous.


What OMPI does (and apparently other MPI's do) is simply invalidate  
any registration for free'd memory.  Additionally, we won't unregister  
memory while there is at least one use of it outstanding (that MPI  
knows about, such as a pending non-blocking communication).  We lazily  
unregister just for exactly the case you're talking about (might want  
to use it for verbs communication again later).


  Per my prior mail, Open MPI registers chucks at a time.  Each  
chunk is

  potentially a multiple of pages.  So yes, you could end up having a
  single registration that spans the buffers used in multiple,  
distinct
  MPI sends.  We reference count by page to ensure that  
deregistrations

  do not occur prematurely.

Hmm, I'm worried that the exact semantics of the memory cache seem  
to be

tied into how the MPI implementation is registering memory.  Open MPI
happens to work in small chunks (I guess) and so your cache is  
tailored
for that use case.  I know the original proposal was an attempt to  
come

up with something that all the MPIs can agree on, but it didn't cover
the full semantics, at least not for cases like the overlapping
sub-registrations that we're discussing here.  Is there still one  
set of

semantics everyone can agree on?




So just to be clear -- let's separate the two issues that are evolving  
from this thread:


1. fix the hole where memory returned to the OS cannot be guaranteed  
to be caught by userspace (and therefore may still stay registered and/ 
or invalidate userspace registration cache entries)


2. have libibverbs include some form of memory registration caching  
(potentially using the solution to #1 to know when to invalidate reg.  
cache entries)


Personally, I would prioritize them in the issues in this order.  Did  
a solution for #1 get agreed upon?  I admit that I got lost in the  
kernel discussion of issues between you, Jason, etc.


Agreeing on registration caching semantics may take a little more  
discussion (although, as someone pointed out earlier, if libibverbs'  
reg caching is optional, then the verbs-based app can choose to use it  
or their own scheme).


--
Jeff Squyres
Cisco Systems

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-18 Thread Jeff Squyres

On May 7, 2009, at 5:58 PM, Roland Dreier (rdreier) wrote:


  Specifically: the actual dereg of 0x1000-0x3fff is blocked on also
  releasing 0x2000-0x2fff.

If everyone is doing this, how do you handle the case that Jason  
pointed

out, namely:

 * you register 0x1000 ... 0x3fff
 * you want to register 0x2000 ... 0x2fff and have a cache hit
 * you finish up with 0x1000 ... 0x3fff
 * app does something (which is valid since you finished up with the
   bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg free()
   that leads to munmap() or whatever), and your hooks tell you so.
 * app reallocates a mapping in 0x3000 ... 0x3fff
 * you want to re-register 0x1000 ... 0x3fff -- but it has to be  
marked

   both invalid and in-use in the cache at this point !?




Sorry; this mail slipped by me and I just saw it now.

If this can actually happen -- that the mapping of 0x1000 ... 0x3fff  
can change even though it is still registered, then we're screwed --  
we have no way of knowing that this is now invalid (Open MPI, at least  
-- can't speak for others).


Is there a way to detect condition this in userspace?

--
Jeff Squyres
Cisco Systems

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-18 Thread Caitlin Bestler
On Mon, May 18, 2009 at 9:24 AM, Jeff Squyres jsquy...@cisco.com wrote:
 On May 7, 2009, at 5:58 PM, Roland Dreier (rdreier) wrote:

   Specifically: the actual dereg of 0x1000-0x3fff is blocked on also
   releasing 0x2000-0x2fff.

 If everyone is doing this, how do you handle the case that Jason pointed
 out, namely:

  * you register 0x1000 ... 0x3fff
  * you want to register 0x2000 ... 0x2fff and have a cache hit
  * you finish up with 0x1000 ... 0x3fff
  * app does something (which is valid since you finished up with the
   bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg free()
   that leads to munmap() or whatever), and your hooks tell you so.
  * app reallocates a mapping in 0x3000 ... 0x3fff
  * you want to re-register 0x1000 ... 0x3fff -- but it has to be marked
   both invalid and in-use in the cache at this point !?



 Sorry; this mail slipped by me and I just saw it now.

 If this can actually happen -- that the mapping of 0x1000 ... 0x3fff can
 change even though it is still registered, then we're screwed -- we have no
 way of knowing that this is now invalid (Open MPI, at least -- can't speak
 for others).

 Is there a way to detect condition this in userspace?

How does 0x1000 to 0x3fff get registered as a single Memory Region?
If it is legitimate to free() 0x3000..0x3fff then how can there ever be a
legitimate reference to 0x1000..0x3fff? If there is no such single reference,
I don't see how a Memory Region is every created covering that range.

If the user creates the Memory Region, then they are responsible for not
free()ing a portion of it.

Would the MPI library ever create a single large memory region based on
two distinct Sends?
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-18 Thread Jeff Squyres

On May 18, 2009, at 2:02 PM, Caitlin Bestler wrote:

   Specifically: the actual dereg of 0x1000-0x3fff is blocked on  
also

   releasing 0x2000-0x2fff.

 If everyone is doing this, how do you handle the case that Jason  
pointed

 out, namely:

  * you register 0x1000 ... 0x3fff
  * you want to register 0x2000 ... 0x2fff and have a cache hit
  * you finish up with 0x1000 ... 0x3fff
  * app does something (which is valid since you finished up with  
the
   bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg  
free()

   that leads to munmap() or whatever), and your hooks tell you so.
  * app reallocates a mapping in 0x3000 ... 0x3fff
  * you want to re-register 0x1000 ... 0x3fff -- but it has to be  
marked

   both invalid and in-use in the cache at this point !?



I think I mis-parsed the above scenario in my previous response.

When our memory hooks tell us that memory is about to be removed from  
the process, we unregister all pages in the relevant region and remove  
those entries from the cache.  So the next time you look in the cache  
for 0x3000-0x3fff, it won't be there -- it'll be treated as cache-cold.



How does 0x1000 to 0x3fff get registered as a single Memory Region?
If it is legitimate to free() 0x3000..0x3fff then how can there ever  
be a
legitimate reference to 0x1000..0x3fff? If there is no such single  
reference,

I don't see how a Memory Region is every created covering that range.

If the user creates the Memory Region, then they are responsible for  
not

free()ing a portion of it.



Agreed.  If an application does that, it deserves what it gets.

Would the MPI library ever create a single large memory region based  
on

two distinct Sends?




Per my prior mail, Open MPI registers chucks at a time.  Each chunk is  
potentially a multiple of pages.  So yes, you could end up having a  
single registration that spans the buffers used in multiple, distinct  
MPI sends.  We reference count by page to ensure that deregistrations  
do not occur prematurely.


For example, if page X contains the end of one large buffer and the  
beginning of another, both of which are being used in ongoing non- 
blocking MPI communications.  Then page X's entry on our cache will  
have a refcount == 2.  OMPI won't allow the registration containing  
that page to become eligible for deregistering until the cache entry's  
refcount goes down to 0.


See my prior mail for a more complex example of our cache's behavior.

--
Jeff Squyres
Cisco Systems

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-18 Thread Roland Dreier

  When our memory hooks tell us that memory is about to be removed from
  the process, we unregister all pages in the relevant region and remove
  those entries from the cache.  So the next time you look in the cache
  for 0x3000-0x3fff, it won't be there -- it'll be treated as
  cache-cold.

So you want the registration cache to be reference counted per-page?
Seems like potentially a lot of overhead -- if someone registers a
million pages, then to check for a cache hit, you have to potentially
check millions of reference counts.

   How does 0x1000 to 0x3fff get registered as a single Memory Region?
   If it is legitimate to free() 0x3000..0x3fff then how can there ever
   be a
   legitimate reference to 0x1000..0x3fff? If there is no such single
   reference,
   I don't see how a Memory Region is every created covering that range.
  
   If the user creates the Memory Region, then they are responsible for
   not
   free()ing a portion of it.
  
  
  Agreed.  If an application does that, it deserves what it gets.

Hang on.  The whole point of MR caching is exactly that you don't
unregister a memory region, even after you're done using the memory it
covers, in the hope that you'll want to reuse that registration.  And
the whole point of this thread is that an application can then free()
some of the memory that is still registered in the cache.

  Per my prior mail, Open MPI registers chucks at a time.  Each chunk is
  potentially a multiple of pages.  So yes, you could end up having a
  single registration that spans the buffers used in multiple, distinct
  MPI sends.  We reference count by page to ensure that deregistrations
  do not occur prematurely.

Hmm, I'm worried that the exact semantics of the memory cache seem to be
tied into how the MPI implementation is registering memory.  Open MPI
happens to work in small chunks (I guess) and so your cache is tailored
for that use case.  I know the original proposal was an attempt to come
up with something that all the MPIs can agree on, but it didn't cover
the full semantics, at least not for cases like the overlapping
sub-registrations that we're discussing here.  Is there still one set of
semantics everyone can agree on?

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-11 Thread Jonathan Perkins
On Tue, May 05, 2009 at 04:57:09PM -0400, Jeff Squyres wrote:
 Roland and I chatted on the phone today; I think I now understand  
 Roland's counter-proposal (I clearly didn't before).  Let me try to  
 summarize:

 1. Add a new verb for set this userspace flag to 1 if mr X ever becomes 
 invalid
 2. Add a new verb for no longer tell me if mr X ever becomes invalid 
 (i.e., remove the effects of #1)
 3. Add run-time query indicating whether #1 works
 4. Add [optional] memory registration caching to libibverbs

 Prior to talking to Roland, I had envisioned *one* flag in userspace  
 that indicated whether any memory registrations had become invalid.   
 Roland's idea is that there is one flag *per registration* -- you can  
 instantly tell whether a specific registration is valid.

 Given this, let's keep the discussion going here in email -- perhaps the 
 teleconference next Monday may become moot.

It looks like there has been more discussion on how to implement this
idea.  Are we still planning on having this teleconference today?

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo


pgpWUb6R9iQyz.pgp
Description: PGP signature
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Memory registration redux

2009-05-11 Thread Caitlin Bestler
On Thu, May 7, 2009 at 3:48 PM, Jason Gunthorpe
jguntho...@obsidianresearch.com wrote:

 Right, I was only thinking of a new driver call that was along the
 lines of update_mr_pages() that just updates the HCA's mapping with
 new page table entires atomically. It really would be device
 specific. If there is no call available then unregister/register +
 printk log is a fair generic implementation.

 To be clear, what I'm thinking is that this would only be invoked if

Both the IBTA and RDMAC verbs were defined so that the meaning of
L-Key/R-Key/STag + Address could not
instantly change from X to Y, only from X to NULL and then NULL to Y.

There are a lot of good reasons for this, especially for R-Keys or
remotely accessible STags. It ensures that
all operations that started when the translation was X are completed
before any that will use the Y translation
can commence. That is not something we want to accidentally undermine.

There really isn't a reason why this rule needed to apply to entire
Memory Regions. So I don't see a problem
with allowing an update_mr_pages() verb that changes a portion of an
MR map, perhaps by optimal machine
specific hooks when available, without requiring the entire MR be
specified. But it must preserve the guarantee
that all operations initiated with translation X are completed
before any operations for translation Y can be initiated.

Preserving this guarantee should not be a problem for the free() then
reallocate scenarios that have been discussed.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-11 Thread Jason Gunthorpe
On Mon, May 11, 2009 at 02:23:58PM -0700, Caitlin Bestler wrote:
 On Thu, May 7, 2009 at 3:48 PM, Jason Gunthorpe
 jguntho...@obsidianresearch.com wrote:
 
  Right, I was only thinking of a new driver call that was along the
  lines of update_mr_pages() that just updates the HCA's mapping with
  new page table entires atomically. It really would be device
  specific. If there is no call available then unregister/register +
  printk log is a fair generic implementation.
 
  To be clear, what I'm thinking is that this would only be invoked if
 
 Both the IBTA and RDMAC verbs were defined so that the meaning of
 L-Key/R-Key/STag + Address could not instantly change from X to
 Y, only from X to NULL and then NULL to Y.

Well, this is sort of a grey area, in one sense the meaning isn't
changing, just the underlying phyiscal memory is being moved around by
the OS.

The notion that the verbs refer to some sort of invisible underlying
VM object is nice for an implementation but pretty useless for
MPI..

 There are a lot of good reasons for this, especially for R-Keys or
 remotely accessible STags. It ensures that all operations that
 started when the translation was X are completed before any that
 will use the Y translation can commence. That is not something we
 want to accidentally undermine.

I'm not sure I see how this helps, synchronizing all this is the
responsibility of the application, if it wants to change the mapping
then it should be able to, and if it does so with poor timing then it
will have races and loose data shrug. As it stands today there are
already races where apps can loose data transfered after an unmap() or
transfer the wrong data after a mmap() so the current model is already
broken from that perspective.

Of course an update verb has to operate with similar ordering
guarantees to regsiter/unregister relative to the local work request
queue - that is to say if the verb is done out-of-line with the WR
queue then it must wait for the queue to flush before issuing the
update to the HCA - just like unregister - and then wait for the verb
to complete before returning to the app - just like register.

And we all wish for userspace FRMRs...

Jason
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-07 Thread Jeff Squyres

On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote:


By the way, what's the desired behavior of the cache if a process
registers, say, address range 0x1000 ... 0x3fff, and then the same
process registers address range 0x2000 ... 0x2fff (with all the same
permissions, etc)?

The initial registration creates an MR that is still valid for the
smaller virtual address range, so the second registration is much
cheaper if we used the cached registration; but if we use the cache  
for
the second registration, and then deregister the first one, we're  
stuck

with a too-big range pinned in the cache because of the second
registration.




I don't know what the other MPI's do in this scenario, but here's what  
OMPI will do:


1. lookup 0x1000-0x3fff in the cache; not find any of it it, and  
therefore register

   - add each page to our cache with a refcount of 1
2. lookup 0x2000-0x2fff in the cache, find that all the pages are  
already registered

   - refcount++ on each page in the cache
3. when we go to dereg 0x1000-0x3fff
   - refcount-- on each page in the cache
   - since some pages in the range still have refcount0, don't do  
anything further


Specifically: the actual dereg of 0x1000-0x3fff is blocked on also  
releasing 0x2000-0x2fff.


Note that OMPI will only register a max of X bytes at a time (where X  
defaults to 2MB).  So even if a user calls MPI_SEND(...) with an  
enormous buffer, we'll register it X/page_size pages at a time, not  
the entire buffer at once.  Hence, the buffer A is blocked from  
dereg'ing by buffer B scenario is *somewhat* mitigated -- it's less  
wasteful than if we can registered/cached the entire huge buffer at  
once.


Finally, note that if 0x2000-0x2fff had not been registered, the  
0x1000-0x3fff pages are not actually deregistered when all the pages'  
refcounts go to 0 -- they are just moved to the able to be dereg'ed  
list.  We don't actually dereg it until we later try to reg new  
memory and fail due to lack of resources.  Then we take entries off  
the able to be dereg'ed list and dereg them, then try reg'ing the  
new memory again.


MVAPICH: do you guys do similar things?

(I don't know if HP/Scali/Intel will comment on their registration  
cache schemes)


--
Jeff Squyres
Cisco Systems

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] Memory registration redux

2009-05-07 Thread Tang, Changqing

HP-MPI is pretty much doing the similar thing.  --CQ
 

 -Original Message-
 From: general-boun...@lists.openfabrics.org 
 [mailto:general-boun...@lists.openfabrics.org] On Behalf Of 
 Jeff Squyres
 Sent: Thursday, May 07, 2009 8:54 AM
 To: Roland Dreier
 Cc: Pavel Shamis; Hans Westgaard Ry; Terry Dontje; Lenny 
 Verkhovsky; HÃ¥kon Bugge; Donald Kerr; OpenFabrics General; 
 Alexander Supalov
 Subject: Re: [ofa-general] Memory registration redux
 
 On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote:
 
  By the way, what's the desired behavior of the cache if a process 
  registers, say, address range 0x1000 ... 0x3fff, and then the same 
  process registers address range 0x2000 ... 0x2fff (with all 
 the same 
  permissions, etc)?
 
  The initial registration creates an MR that is still valid for the 
  smaller virtual address range, so the second registration is much 
  cheaper if we used the cached registration; but if we use the cache 
  for the second registration, and then deregister the first 
 one, we're 
  stuck with a too-big range pinned in the cache because of 
 the second 
  registration.
 
 
 
 I don't know what the other MPI's do in this scenario, but 
 here's what OMPI will do:
 
 1. lookup 0x1000-0x3fff in the cache; not find any of it it, 
 and therefore register
 - add each page to our cache with a refcount of 1 2. 
 lookup 0x2000-0x2fff in the cache, find that all the pages 
 are already registered
 - refcount++ on each page in the cache 3. when we go to 
 dereg 0x1000-0x3fff
 - refcount-- on each page in the cache
 - since some pages in the range still have refcount0, 
 don't do anything further
 
 Specifically: the actual dereg of 0x1000-0x3fff is blocked on 
 also releasing 0x2000-0x2fff.
 
 Note that OMPI will only register a max of X bytes at a time 
 (where X defaults to 2MB).  So even if a user calls 
 MPI_SEND(...) with an enormous buffer, we'll register it 
 X/page_size pages at a time, not the entire buffer at once.  
 Hence, the buffer A is blocked from dereg'ing by buffer B 
 scenario is *somewhat* mitigated -- it's less wasteful than 
 if we can registered/cached the entire huge buffer at once.
 
 Finally, note that if 0x2000-0x2fff had not been registered, 
 the 0x1000-0x3fff pages are not actually deregistered when 
 all the pages'  
 refcounts go to 0 -- they are just moved to the able to be 
 dereg'ed list.  We don't actually dereg it until we later 
 try to reg new memory and fail due to lack of resources.  
 Then we take entries off the able to be dereg'ed list and 
 dereg them, then try reg'ing the new memory again.
 
 MVAPICH: do you guys do similar things?
 
 (I don't know if HP/Scali/Intel will comment on their 
 registration cache schemes)
 
 --
 Jeff Squyres
 Cisco Systems
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit 
 http://openib.org/mailman/listinfo/openib-general
 ___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] Memory registration redux

2009-05-07 Thread Matthew Koop

MVAPICH is also doing pretty much the same thing as well.

Matt

On Thu, 7 May 2009, Tang, Changqing wrote:


 HP-MPI is pretty much doing the similar thing.  --CQ


  -Original Message-
  From: general-boun...@lists.openfabrics.org
  [mailto:general-boun...@lists.openfabrics.org] On Behalf Of
  Jeff Squyres
  Sent: Thursday, May 07, 2009 8:54 AM
  To: Roland Dreier
  Cc: Pavel Shamis; Hans Westgaard Ry; Terry Dontje; Lenny
  Verkhovsky; H?kon Bugge; Donald Kerr; OpenFabrics General;
  Alexander Supalov
  Subject: Re: [ofa-general] Memory registration redux
 
  On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote:
 
   By the way, what's the desired behavior of the cache if a process
   registers, say, address range 0x1000 ... 0x3fff, and then the same
   process registers address range 0x2000 ... 0x2fff (with all
  the same
   permissions, etc)?
  
   The initial registration creates an MR that is still valid for the
   smaller virtual address range, so the second registration is much
   cheaper if we used the cached registration; but if we use the cache
   for the second registration, and then deregister the first
  one, we're
   stuck with a too-big range pinned in the cache because of
  the second
   registration.
  
 
 
  I don't know what the other MPI's do in this scenario, but
  here's what OMPI will do:
 
  1. lookup 0x1000-0x3fff in the cache; not find any of it it,
  and therefore register
  - add each page to our cache with a refcount of 1 2.
  lookup 0x2000-0x2fff in the cache, find that all the pages
  are already registered
  - refcount++ on each page in the cache 3. when we go to
  dereg 0x1000-0x3fff
  - refcount-- on each page in the cache
  - since some pages in the range still have refcount0,
  don't do anything further
 
  Specifically: the actual dereg of 0x1000-0x3fff is blocked on
  also releasing 0x2000-0x2fff.
 
  Note that OMPI will only register a max of X bytes at a time
  (where X defaults to 2MB).  So even if a user calls
  MPI_SEND(...) with an enormous buffer, we'll register it
  X/page_size pages at a time, not the entire buffer at once.
  Hence, the buffer A is blocked from dereg'ing by buffer B
  scenario is *somewhat* mitigated -- it's less wasteful than
  if we can registered/cached the entire huge buffer at once.
 
  Finally, note that if 0x2000-0x2fff had not been registered,
  the 0x1000-0x3fff pages are not actually deregistered when
  all the pages'
  refcounts go to 0 -- they are just moved to the able to be
  dereg'ed list.  We don't actually dereg it until we later
  try to reg new memory and fail due to lack of resources.
  Then we take entries off the able to be dereg'ed list and
  dereg them, then try reg'ing the new memory again.
 
  MVAPICH: do you guys do similar things?
 
  (I don't know if HP/Scali/Intel will comment on their
  registration cache schemes)
 
  --
  Jeff Squyres
  Cisco Systems
 
  ___
  general mailing list
  general@lists.openfabrics.org
  http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
  To unsubscribe, please visit
  http://openib.org/mailman/listinfo/openib-general
  ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-07 Thread Roland Dreier
   No... every HCA just needs to support register and unregister.  It
   doesn't have to support changing the mapping without full unregister and
   reregister.
  
  Well, I would imagine this entire process to be a HCA specific
  operation, so HW that supports a better method can use it, otherwise
  it has to register/unregister. Is this a concern today with existing
  HCAs?
  
  Using register/unregister exposes a race for the original case you
  brought up - but that race is completely unfixable without hardware
  support. At least it now becomes a hw specific race that can be
  printk'd and someday fixed in new HW rather than an unsolvable API
  problem..

We definitely don't want to duplicate all this logic in every hardware
device driver, so most of it needs to be generic.  If we're adding new
low-level driver methods to handle this, that definitely raises the cost
of implementing all this.  But I guess if we start with a generic
register/unregister fallback that drivers can override for better
performance, then I think we're in good shape.

   Also this requires potentially walking the page tables of the entire
   process, checking to see if any mappings have changed.  We really want
   to keep the information that the MMU notifiers give us, namely which
   virtual address range is changing.
  
  Walking the page tables of every registration in the process, not the
  entire process.

Yes... but there are bugs in the bugzilla about mthca being limited to
only 8 GB of registration by default or something like that, and having
that break Intel MPI in some cases.  So some MPI jobs want to have 10s
of GBs of registered memory -- walking millions of page table entries
for every resync operation seems like a big problem to me.

Which means that the MMU notifier has to walk the list of memory
registrations and mark any affected ones as dirty (possibly with a hint
about which pages were invalidated) as you suggest below.  Falling back
to the check every registration ultra-slow-path I think should never
ever happen.

  I was thinking more along the lines of having the mmu notifiers put
  affected registrations on a per-process (or PD?) dirty linked list,
  with the link pointers as part of the registration structure. Set a
  dirty flag in the registration too. An extra pointer per registration
  and a minor incremental cost to the existing work the mmu notifier
  would have to do.

Yes, makes sense.

 Only part I don't immediately see is how to trap creation of new VM
 (ie mmap), mmu notifiers seem focused on invalidating, ie munmap()..
   
   Why do we care?  The initial faulting in of mappings occurs when an MR
   is created.
  
  Well, exactly, that's the problem. If you can't trap mmap you cannot
  do the initial faulting and mapping for a new object that is being
  mapped into an existing MR.
  
  Consider:
  
void *a = mmap(0,PAGE_SIZE..);
ibv_register();
// [..]
mmunmap(a);
ibv_synchronize();
  
// At this point we want the HCA mapping to point to oblivion
  
mmap(a,PAGE_SIZE,MAP_FIXED);
ibv_synchronize();
  
// And now we want it to point to the new allocation
  
  I use MAP_FIXED to illustrate the point, but Jeff has said the same
  address re-use happens randomly in real apps.

This can be handled I think, although at some cost.  Just have the
kernel keep track of which MMU sequence number actually invalidated each
MR, and return (via ibv_synchronize()) the MMU change sequence number
that userspace is in sync with.  So in the example above, the first
synchronize after munmap() will fail to fix up the first registration,
since it is pointing to an unmapped virtual address, and hence it will
leave that MR on the dirty list, and return that sequence number as not
being synced up yet.  And then the second synchronize will see that MR
still on the dirty list, and try again to find the pages.

Passing the sequence number back to userspace makes it possible for
userspace to know that it still has to call ibv_synchronize() again.

There is the possibility that a 1GB MR will have its last page unmapped,
and end up having 100s of thousands of pages walked again and again in
every synchronize operation.

  This method avoids the problem you noticed, but there is extra work to
  fixup a registration that may never be used again. I strongly suspect
  that in the majority of cases this extra work should be about on the
  same order as userspace calling unregister on the MR.

Yes, also it doesn't match the current MPI way of lazily unregistering
things, and only garbage collecting the refcnt 0 cache entries when a
registration fails.  With this method, if userspace unregisters
something, it really is gone, and if it doesn't unregister it, then it
really uses up space until userspace explicitly unregisters it.  Not
sure how MPI implementers feel about that.

  Or, ignore the overlapping problem, and use your original technique,
  slightly modified:
   - Userspace 

Re: [ofa-general] Memory registration redux

2009-05-07 Thread Roland Dreier
  I don't know what the other MPI's do in this scenario, but here's what
  OMPI will do:
  
  1. lookup 0x1000-0x3fff in the cache; not find any of it it, and
  therefore register
 - add each page to our cache with a refcount of 1
  2. lookup 0x2000-0x2fff in the cache, find that all the pages are
  already registered
 - refcount++ on each page in the cache
  3. when we go to dereg 0x1000-0x3fff
 - refcount-- on each page in the cache
 - since some pages in the range still have refcount0, don't do
  anything further
  
  Specifically: the actual dereg of 0x1000-0x3fff is blocked on also
  releasing 0x2000-0x2fff.

If everyone is doing this, how do you handle the case that Jason pointed
out, namely:

 * you register 0x1000 ... 0x3fff
 * you want to register 0x2000 ... 0x2fff and have a cache hit
 * you finish up with 0x1000 ... 0x3fff
 * app does something (which is valid since you finished up with the
   bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg free()
   that leads to munmap() or whatever), and your hooks tell you so.
 * app reallocates a mapping in 0x3000 ... 0x3fff
 * you want to re-register 0x1000 ... 0x3fff -- but it has to be marked
   both invalid and in-use in the cache at this point !?

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-07 Thread Jason Gunthorpe
On Thu, May 07, 2009 at 02:46:55PM -0700, Roland Dreier wrote:

   Using register/unregister exposes a race for the original case you
   brought up - but that race is completely unfixable without hardware
   support. At least it now becomes a hw specific race that can be
   printk'd and someday fixed in new HW rather than an unsolvable API
   problem..
 
 We definitely don't want to duplicate all this logic in every hardware
 device driver, so most of it needs to be generic.  If we're adding new
 low-level driver methods to handle this, that definitely raises the cost
 of implementing all this.  But I guess if we start with a generic
 register/unregister fallback that drivers can override for better
 performance, then I think we're in good shape.

Right, I was only thinking of a new driver call that was along the
lines of update_mr_pages() that just updates the HCA's mapping with
new page table entires atomically. It really would be device
specific. If there is no call available then unregister/register +
printk log is a fair generic implementation.

To be clear, what I'm thinking is that this would only be invoked if
the VM is being *replaced*. Simply unmaping VM should do nothing.

 Which means that the MMU notifier has to walk the list of memory
 registrations and mark any affected ones as dirty (possibly with a hint
 about which pages were invalidated) as you suggest below.  Falling back
 to the check every registration ultra-slow-path I think should never
 ever happen.

Yikes, yes, that makes sense. And hearing that at least openmpi caps
the registration size makes me think per-page granularity is probably
unnecessary to track.

   Well, exactly, that's the problem. If you can't trap mmap you cannot
   do the initial faulting and mapping for a new object that is being
   mapped into an existing MR.
   
   Consider:
   
 void *a = mmap(0,PAGE_SIZE..);
 ibv_register();
 // [..]
 mmunmap(a);
 ibv_synchronize();
   
 // At this point we want the HCA mapping to point to oblivion
   
 mmap(a,PAGE_SIZE,MAP_FIXED);
 ibv_synchronize();
   
 // And now we want it to point to the new allocation
   
   I use MAP_FIXED to illustrate the point, but Jeff has said the same
   address re-use happens randomly in real apps.
 
 This can be handled I think, although at some cost.  Just have the
 kernel keep track of which MMU sequence number actually invalidated each
 MR, and return (via ibv_synchronize()) the MMU change sequence number
 that userspace is in sync with.  So in the example above, the first
 synchronize after munmap() will fail to fix up the first registration,
 since it is pointing to an unmapped virtual address, and hence it will
 leave that MR on the dirty list, and return that sequence number as not
 being synced up yet.  And then the second synchronize will see that MR
 still on the dirty list, and try again to find the pages.

I agree some kind of kernel/userspace exchange of the sequence number
is necessary to make all the locking and race conditions work out.

But the problem I'm seeing is how does the sequence number get
incremented by the kernel after the mmap() call in the above sequence?
Which mmu_notifier/etc call back do you hook for that?

The *very best* hook would be one that is called when a mm has new
virtual address space allocated and the verbs layer would then take
the allocated address range and intersect it with the registration
list. Any registrations that have pages in the allocated region are
marked invalid.

Imagine every call to ibv_synchronize was prefixed with a check that
the sequence number is changed.

   This method avoids the problem you noticed, but there is extra work to
   fixup a registration that may never be used again. I strongly suspect
   that in the majority of cases this extra work should be about on the
   same order as userspace calling unregister on the MR.
 
 Yes, also it doesn't match the current MPI way of lazily unregistering
 things, and only garbage collecting the refcnt 0 cache entries when a
 registration fails.  With this method, if userspace unregisters
 something, it really is gone, and if it doesn't unregister it, then it
 really uses up space until userspace explicitly unregisters it.  Not
 sure how MPI implementers feel about that.

Well, mixing the lazy unregister in is not a significant change, just
don't increment the sequence number on munmap and have the kernel do
nothing until pages are mapped into an existing registration. With a
flag both behaviors are possible.

All of this work is mainly to close the hole where mapping new memory
over already registered VM results in RDMA to the wrong pages. Fixing
this hole removes the need to trap memory management syscalls and
solves that data corruption problem.

From there various optimizations can be done, like lazy garbage
collecting registrations that no longer point to mapped memory.

   Or, ignore the overlapping problem, and use your original technique,
  

Re: [ofa-general] Memory registration redux

2009-05-06 Thread Tziporet Koren

Jeff Squyres wrote:
Roland and I chatted on the phone today; I think I now understand 
Roland's counter-proposal (I clearly didn't before).  Let me try to 
summarize:


1. Add a new verb for set this userspace flag to 1 if mr X ever 
becomes invalid
2. Add a new verb for no longer tell me if mr X ever becomes invalid 
(i.e., remove the effects of #1)

3. Add run-time query indicating whether #1 works
4. Add [optional] memory registration caching to libibverbs

Prior to talking to Roland, I had envisioned *one* flag in userspace 
that indicated whether any memory registrations had become invalid.  
Roland's idea is that there is one flag *per registration* -- you can 
instantly tell whether a specific registration is valid.


Given this, let's keep the discussion going here in email -- perhaps 
the teleconference next Monday may become moot.

I think the new proposal is good (but I am not MPI expert)
If we implement it soon we will be able to enable it in OFED 1.5 too
I think the cache in libibverbs can be delayed since it can be added 
after the API will the kernel is avilable


Tziporet


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-06 Thread Jeff Squyres

On May 6, 2009, at 10:09 AM, Tziporet Koren wrote:


I think the new proposal is good (but I am not MPI expert)
If we implement it soon we will be able to enable it in OFED 1.5 too



That sounds good, as long as we don't diverge from upstream (like what  
happened with XRC).



I think the cache in libibverbs can be delayed since it can be added
after the API will the kernel is avilable




Fair enough.

--
Jeff Squyres
Cisco Systems

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-06 Thread Roland Dreier
  Roland and I chatted on the phone today; I think I now understand
  Roland's counter-proposal (I clearly didn't before).  Let me try to
  summarize:
  
  1. Add a new verb for set this userspace flag to 1 if mr X ever
  becomes invalid
  2. Add a new verb for no longer tell me if mr X ever becomes invalid
  (i.e., remove the effects of #1)
  3. Add run-time query indicating whether #1 works
  4. Add [optional] memory registration caching to libibverbs

Looking closer at how to actually implement this, I see that the MMU
notifiers (cf linux/mmu_notifier.h) may be called with locks held, so
the kernel can't do a put_user() or the equivalent from the notifier.
Therefore I think the interface we would expose to userspace would be
something more like mmap() on some special file to get some kernel
memory mapped into userspace, and then ioctl() to register/unregister a
set this flag if address range X...Y is affected.

To be honest I don't really love this idea -- the kernel still needs a
fairly complicated data structure to efficiently track the address
ranges being tracked, the size of the mmap() limits the number of ranges
being tracked based on a static limit set at initialization time (or
handling multiple maps gets still more complex), and there is some
careful thinking required to make sure there are no memory ordering or
cache aliasing issues.

So then I thought some about how to implement the full MR cache in the
kernel.  And that fairly quickly gets into some complex stuff as well --
for example, since we can't take sleeping locks from MMU notifiers, but
we can't hold non-sleeping locks across MR register operations, we need
to drop our MR cache lock while registering things, which forces us to
deal with rolling back registrations if we miss the cache initially but
then find that another thread has already added a registration to the
cache while we were trying to register the same memory.  Keeping the
actual MR caching in userspace does seem to make things simpler because
the locking is much easier without having to worry about sleeping
vs. non-sleeping locks.

Also doing the cache in userspace with my flag idea above has the nice
property that the fast path of hitting the cache on memory registration
has no system call and in fact testing the flag may even be a CPU cache
hit if memory registration is a hot enough path.  Doing it in the kernel
means even the best case has a system call -- which is very cheap with
current CPUs but still a non-zero cost.

So I'm really not sure what the right way to go is yet.  Further
opinions would be helpful.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-06 Thread Roland Dreier
By the way, what's the desired behavior of the cache if a process
registers, say, address range 0x1000 ... 0x3fff, and then the same
process registers address range 0x2000 ... 0x2fff (with all the same
permissions, etc)?

The initial registration creates an MR that is still valid for the
smaller virtual address range, so the second registration is much
cheaper if we used the cached registration; but if we use the cache for
the second registration, and then deregister the first one, we're stuck
with a too-big range pinned in the cache because of the second
registration.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-06 Thread Jason Gunthorpe
On Wed, May 06, 2009 at 01:10:47PM -0700, Roland Dreier wrote:
 By the way, what's the desired behavior of the cache if a process
 registers, say, address range 0x1000 ... 0x3fff, and then the same
 process registers address range 0x2000 ... 0x2fff (with all the same
 permissions, etc)?
 
 The initial registration creates an MR that is still valid for the
 smaller virtual address range, so the second registration is much
 cheaper if we used the cached registration; but if we use the cache for
 the second registration, and then deregister the first one, we're stuck
 with a too-big range pinned in the cache because of the second
 registration.

Yuk, doesn't this problem pretty much doom this method entirely? You
can't tear down the entire registration of 0x1000 ... 0x3fff if the app
does something to change 0x2000 .. 0x2fff because it may have active
RDMAs going on in 0x1000 ... 0x1fff.

The above could happen through strange use of brk.

What about a slightly different twist.. Instead of trying to make
everything synchronous in the mmu_notifier, just have a counter mapped
to user space. Increment the counter whenever the mms change from the
notifier. Pin the user page that contains the single counter upon
starting the process so access is lockless.

In user space, check the counter before every cache lookup and if it
has changed call back into the kernel to resynchronize the MR tables in
the HCA to the current VM.

Avoids the locking and racing problems, keeps the fast path in the
user space and avoids the above question about how to deal with
arbitrary actions?

Jason
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-06 Thread Roland Dreier
  Yuk, doesn't this problem pretty much doom this method entirely? You
  can't tear down the entire registration of 0x1000 ... 0x3fff if the app
  does something to change 0x2000 .. 0x2fff because it may have active
  RDMAs going on in 0x1000 ... 0x1fff.

Yes, I guess if we try to reuse registrations like this then we run into
trouble.  I think your example points to a problem if an app registers
0x1000...0x3fff and then we reuse that registration for 0x2000...0x2fff
and also for 0x1000...0x1fff, and then the app unregisters 0x1000...0x3fff.

But we can get around this just by not ever reusing registrations that
way -- only treat something as a cache hit if it matches the start and
length exactly.

  What about a slightly different twist.. Instead of trying to make
  everything synchronous in the mmu_notifier, just have a counter mapped
  to user space. Increment the counter whenever the mms change from the
  notifier. Pin the user page that contains the single counter upon
  starting the process so access is lockless.
  
  In user space, check the counter before every cache lookup and if it
  has changed call back into the kernel to resynchronize the MR tables in
  the HCA to the current VM.
  
  Avoids the locking and racing problems, keeps the fast path in the
  user space and avoids the above question about how to deal with
  arbitrary actions?

I like the simplicity of the fast path.  But it seems the slow path
would be hard -- how exactly did you envision resynchronizing the MR
tables?  (Considering that RDMAs might be in flight for MRs that weren't
changed by the MM operations)

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-06 Thread Jason Gunthorpe
On Wed, May 06, 2009 at 02:56:25PM -0700, Roland Dreier wrote:
   Yuk, doesn't this problem pretty much doom this method entirely? You
   can't tear down the entire registration of 0x1000 ... 0x3fff if the app
   does something to change 0x2000 .. 0x2fff because it may have active
   RDMAs going on in 0x1000 ... 0x1fff.
 
 Yes, I guess if we try to reuse registrations like this then we run into
 trouble.  I think your example points to a problem if an app registers
 0x1000...0x3fff and then we reuse that registration for 0x2000...0x2fff
 and also for 0x1000...0x1fff, and then the app unregisters 0x1000...0x3fff.
 
 But we can get around this just by not ever reusing registrations that
 way -- only treat something as a cache hit if it matches the start and
 length exactly.

I can't comment on that, but it feels to me like a reasonable MPI use
model would be to do small IOs randomly from the same allocation, and
pre-hint to the library it wants that whole area cached in one shot.

   What about a slightly different twist.. Instead of trying to make
   everything synchronous in the mmu_notifier, just have a counter mapped
   to user space. Increment the counter whenever the mms change from the
   notifier. Pin the user page that contains the single counter upon
   starting the process so access is lockless.
   
   In user space, check the counter before every cache lookup and if it
   has changed call back into the kernel to resynchronize the MR tables in
   the HCA to the current VM.
   
   Avoids the locking and racing problems, keeps the fast path in the
   user space and avoids the above question about how to deal with
   arbitrary actions?
 
 I like the simplicity of the fast path.  But it seems the slow path
 would be hard -- how exactly did you envision resynchronizing the MR
 tables?  (Considering that RDMAs might be in flight for MRs that weren't
 changed by the MM operations)

Well, this conceptually doesn't seem hard. Go through all the pages in
the MR, if any have changed then pin the new page and replace the
pages physical address in the HCA's page table. Once done, synchronize
with the hardware, then run through again and un-pin and release all
the replaced pages.

Every HCA must have the necessary primitives for this to support
register and unregister...

An RDMA that is in progress to any page that is replaced is a
'use after free' type programming error. (And this means certain wacky
uses, like using MAP_FIXED on memory that is under active RDMA,
would be unsupported without an additional call)

Doing this on a page by page basis rather than on a registration by
registration basis is granular enough to avoid the problem you
noticed.

The mmu notifiers can optionally make note of the affected pages
during the callback to reduce the workload of the syscall.

Only part I don't immediately see is how to trap creation of new VM
(ie mmap), mmu notifiers seem focused on invalidating, ie munmap()..

Jason
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-06 Thread Roland Dreier
  Well, this conceptually doesn't seem hard. Go through all the pages in
  the MR, if any have changed then pin the new page and replace the
  pages physical address in the HCA's page table. Once done, synchronize
  with the hardware, then run through again and un-pin and release all
  the replaced pages.
  
  Every HCA must have the necessary primitives for this to support
  register and unregister...

No... every HCA just needs to support register and unregister.  It
doesn't have to support changing the mapping without full unregister and
reregister.

Also this requires potentially walking the page tables of the entire
process, checking to see if any mappings have changed.  We really want
to keep the information that the MMU notifiers give us, namely which
virtual address range is changing.

  The mmu notifiers can optionally make note of the affected pages
  during the callback to reduce the workload of the syscall.

This requires an unbounded amount of events to be queued up in the
kernel, naively.  (If we lose some events then we have to go back to the
full page table scan, which I don't think is feasible)

  Only part I don't immediately see is how to trap creation of new VM
  (ie mmap), mmu notifiers seem focused on invalidating, ie munmap()..

Why do we care?  The initial faulting in of mappings occurs when an MR
is created.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Memory registration redux

2009-05-06 Thread Jason Gunthorpe
On Wed, May 06, 2009 at 03:39:54PM -0700, Roland Dreier wrote:
   Well, this conceptually doesn't seem hard. Go through all the pages in
   the MR, if any have changed then pin the new page and replace the
   pages physical address in the HCA's page table. Once done, synchronize
   with the hardware, then run through again and un-pin and release all
   the replaced pages.
   
   Every HCA must have the necessary primitives for this to support
   register and unregister...
 
 No... every HCA just needs to support register and unregister.  It
 doesn't have to support changing the mapping without full unregister and
 reregister.

Well, I would imagine this entire process to be a HCA specific
operation, so HW that supports a better method can use it, otherwise
it has to register/unregister. Is this a concern today with existing
HCAs?

Using register/unregister exposes a race for the original case you
brought up - but that race is completely unfixable without hardware
support. At least it now becomes a hw specific race that can be
printk'd and someday fixed in new HW rather than an unsolvable API
problem..

 Also this requires potentially walking the page tables of the entire
 process, checking to see if any mappings have changed.  We really want
 to keep the information that the MMU notifiers give us, namely which
 virtual address range is changing.

Walking the page tables of every registration in the process, not the
entire process.

   The mmu notifiers can optionally make note of the affected pages
   during the callback to reduce the workload of the syscall.
 
 This requires an unbounded amount of events to be queued up in the
 kernel, naively.  (If we lose some events then we have to go back to the
 full page table scan, which I don't think is feasible)

I was thinking more along the lines of having the mmu notifiers put
affected registrations on a per-process (or PD?) dirty linked list,
with the link pointers as part of the registration structure. Set a
dirty flag in the registration too. An extra pointer per registration
and a minor incremental cost to the existing work the mmu notifier
would have to do.

   Only part I don't immediately see is how to trap creation of new VM
   (ie mmap), mmu notifiers seem focused on invalidating, ie munmap()..
 
 Why do we care?  The initial faulting in of mappings occurs when an MR
 is created.

Well, exactly, that's the problem. If you can't trap mmap you cannot
do the initial faulting and mapping for a new object that is being
mapped into an existing MR.

Consider:

  void *a = mmap(0,PAGE_SIZE..);
  ibv_register();
  // [..]
  mmunmap(a);
  ibv_synchronize();

  // At this point we want the HCA mapping to point to oblivion

  mmap(a,PAGE_SIZE,MAP_FIXED);
  ibv_synchronize();

  // And now we want it to point to the new allocation

I use MAP_FIXED to illustrate the point, but Jeff has said the same
address re-use happens randomly in real apps.

This is the main deviation from your original idea, instead of having
a granular notification to userspace to unregister a region, the
kernel just goes and fixes it up so the existing registration still
works.

This method avoids the problem you noticed, but there is extra work to
fixup a registration that may never be used again. I strongly suspect
that in the majority of cases this extra work should be about on the
same order as userspace calling unregister on the MR.

Or, ignore the overlapping problem, and use your original technique,
slightly modified:
 - Userspace registers a counter with the kernel. Kernel pins the
   page, sets up mmu notifiers and increments the counter when
   invalidates intersect with registrations
 - Kernel maintains a linked list of registrations that have been
   invalidated via mmu notifiers using the registration structure
   and a dirty bit
 - Userspace checks the counter at every cache hit, if different it
   calls into the kernel:
   MR_Cookie *mrs[100];
   int rc = ibv_get_invalid_mrs(mrs,100);
   invalidate_cache(mrs,rc);
   // Repeat until drained

   get_invalid_mrs traverses the linked list and returns an
   identifying value to userspace, which looks it up in the cache,
   calls unregister and removes it from the cache.

Jason
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general