[ofa-general] [PATCH] parse_node_map: print parse errors
Hello, could you please add the patch below, without it I probably never would have realized why my node name map was not accepted. Btw, I'm a bit surprised there don't seem to be any default wrappers, for fopen(), fclose(), malloc(), fprintf(), etc. diff -rup opensm-3.2.1.old/complib/cl_nodenamemap.c opensm-3.2.1/complib/cl_nodenamemap.c --- opensm-3.2.1.old/complib/cl_nodenamemap.c 2008-04-03 13:17:35.0 +0200 +++ opensm-3.2.1/complib/cl_nodenamemap.c 2008-04-04 11:09:42.0 +0200 @@ -55,8 +55,11 @@ static int map_name(void *cxt, uint64_t return 0; item = malloc(sizeof(*item)); - if (!item) + if (!item) { + fprintf(stderr, Malloc failed, sizeof(*item) = %d.\n, sizeof(*item)); return -1; + } + item-guid = guid; item-name = strdup(p); cl_qmap_insert(map, item-guid, (cl_map_item_t *)item); @@ -169,6 +172,8 @@ int parse_node_map(const char *file_name guid = strtoull(p, e, 0); if (e == p || (!isspace(*e) *e != '#' *e != '\0')) { fclose(f); + fprintf (stderr, %s: Parse error in line: %s\n, +__func__, line); return -1; } Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] ERR 0108: Unknown remote side
Hello, opensm-3.2.1 logs some error messages like this: Apr 04 00:00:08 325114 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side for node 0 x000b8c002ba2(SW_pfs1_leaf4) port 13. Adding to light sweep sampling list Apr 04 00:00:08 325126 [4580A960] 0x01 - Directed Path Dump of 3 hop path: Path = 0,1,14,13 From ibnetdiscover output I see port13 of this switch is a switch-interconnect (sorry, I don't know what the correct name/identifier for switches within switches): [13]S-000b8c002bfa[13]# SW_pfs1_inter7 lid 263 4xSDR Apr 04 00:00:08 325219 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side for node 0 x000b8c002bf9(SW_pfs1_inter6) port 9. Adding to light sweep sampling list Apr 04 00:00:08 325234 [4580A960] 0x01 - Directed Path Dump of 2 hop path: Path = 0,1,18 This is again an interconnection: [9] S-000b8c002b9e[15]# SW_pfs1_leaf1 lid 177 4xDDR Apr 04 00:00:08 325288 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side for node 0 x000b8c002bfa(SW_pfs1_inter7) port 13. Adding to light sweep sampling list Apr 04 00:00:08 325301 [4580A960] 0x01 - Directed Path Dump of 2 hop path: Path = 0,1,14 And again an interconnection: [13]S-000b8c002ba2[13]# SW_pfs1_leaf4 lid 182 4xDDR All the other interconnections seem to be fine. Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: EMM: disable other notifiers before register and unregister
On Thu, Apr 03, 2008 at 12:20:41PM -0700, Christoph Lameter wrote: On Thu, 3 Apr 2008, Andrea Arcangeli wrote: My attempt to fix this once and for all is to walk all vmas of the mm inside mmu_notifier_register and take all anon_vma locks and i_mmap_locks in virtual address order in a row. It's ok to take those inside the mmap_sem. Supposedly if anybody will ever take a double lock it'll do in order too. Then I can dump all the other locking and What about concurrent mmu_notifier registrations from two mm_structs that have shared mappings? Isnt there a potential deadlock situation? No, the ordering of the lock avoids that. Here a snippnet. /* * This operation locks against the VM for all pte/vma/mm related * operations that could ever happen on a certain mm. This includes * vmtruncate, try_to_unmap, and all page faults. The holder * must not hold any mm related lock. A single task can't take more * than one mm lock in a row or it would deadlock. */ So you can't do: mm_lock(mm1); mm_lock(mm2); But if two different tasks run the mm_lock everything is ok. Each task in the system can lock at most 1 mm at time. Well good luck. Hopefully we will get to something that works. Looks good so far but I didn't finish it yet. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] 2003 microsoft office professional with business contact manager for outlook - $69
Type %lunoem. com% in Inter_net_Exp1o_rer Please kill any %%% symbols from address roxio easy media creator 8 - $39 adobe after effects cs3 - $69 adobe font folio 11 - $189 adobe photoshop cs3 extended - $89 microsoft visual basic professional 6.0 - $49 adobe audition 2.0 - $49 ulead photoimpact 12 - $79 Goto %lunoem. com% ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] can not join due to rate:2.5Gbps group:10Gbps?
I'm trying to get a few nodes here connected with IPoIB. On the first node I have tried with, after ifconfig'ing the interface into the network with other IPoIB nodes I cannot seem to ping any other nodes. I ran ibdiagnet and got a /tmp/ibdiagnet.pkey file with the following contents: sata14:/ # cat /tmp/ibdiagnet.pkey GROUP PKey:0x7fff Hosts:4 Full sata15/P2 lid=0x0004 guid=0x00066a01a363 dev=23108 Full sata14/P2 lid=0x0006 guid=0x00066a01a2bf dev=23108 Full sata23/P2 lid=0x0008 guid=0x00066a01a2fe dev=23108 Full sata16/P2 lid=0x0007 guid=0x00066a01a2c1 dev=23108 When I run an ibdiagpath -l 0x0004 I get the following: -W- Topology file is not specified. Reports regarding cluster links will use direct routes. -I- Using port 2 as the local port. -I--- -I- Traversing the path from local to destination -I--- -I- From: lid=0x0006 guid=0x00066a01a2bf dev=23108 sata14/P2 -I- To: lid=0x0001 guid=0x00066a00c8000180 dev=5 Port=1 -I- From: lid=0x0001 guid=0x00066a00c8000180 dev=5 Port=2 -I- To: lid=0x0004 guid=0x00066a01a363 dev=23108 sata15/P2 -I--- -I- PM Counters Info -I--- -I- No illegal PM counters values were found -I--- -I- Path Partitions Report -I--- -I- Source sata14/P2 lid=0x0006 guid=0x00066a01a2bf dev=23108 Port 2 PKeys:0x -I- Destination sata15 lid=0x0004 guid=0x00066a01a363 dev=23108 PKeys:0x -I- Path shared PKeys: 0x -I--- -I- IPoIB Path Check -I--- -I- Subnet: IPv4 PKey:0x7fff QKey:0x MTU:2048Byte rate:10Gbps SL:0x00 -W- Port sata14/P2 lid=0x0006 guid=0x00066a01a2bf dev=23108 can not join due to rate:2.5Gbps group:10Gbps -W- Port sata15/P2 lid=0x0004 guid=0x00066a01a363 dev=23108 can not join due to rate:2.5Gbps group:10Gbps -E- No IPoIB Subnets found on Path! Nodes can not communicate via IPoIB! -I--- -I- QoS on Path Check -I--- -W- Blocked VLs:4 5 at node:sata14 lid=0x0006 guid=0x00066a01a2bf dev=23108 port:2 -W- Blocked VLs:4 5 at node: lid=0x0001 guid=0x00066a00c8000180 dev=5 port:2 -I- The following SLs can be used:0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 -I- Done. Run time was 0 seconds. That IPoIB Path Check looks a bit alarming. Anyone have any suggestions? b. signature.asc Description: This is a digitally signed message part ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?
Or Gerlitz wrote: On Thu, Apr 3, 2008 at 6:17 PM, Steve Wise [EMAIL PROTECTED] wrote: I think RDS might be getting confused because the 10GbE rnic shows up as a dumb NIC hooked into the native TCP stack -and- an rdma device. Jon Mason will be working to enable RDS soon on the chelsio device. He'll feed back the changes needed, if any, to RDS. Stay tuned. Steve, I understand that a similar work has been done at least to some extent with open MPI, and I will be very happy to hear the lessons learned. Did you manage to have the same (say point to point) open mpi transport design/code work over rdma-cm over both IB and iWARP? Definitely. We're running over rdma-cm over mthca and cxgb3 on 2 nodes today. 8 nodes over cxgb3. We're working out the details now. Can someone from OGC or Chelsio drive a BOF on that in Sonoma? If not, can some notes be sent to the list? I say lets learn from what you did so far... We won't be in Sonoma, but perhaps Jon can email some info to the list on what we've done to-date for open mpi. Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps?
On Fri, 2008-04-04 at 10:36 -0400, Brian J. Murrell wrote: I'm trying to get a few nodes here connected with IPoIB. On the first node I have tried with, after ifconfig'ing the interface into the network with other IPoIB nodes I cannot seem to ping any other nodes. I ran ibdiagnet and got a /tmp/ibdiagnet.pkey file with the following contents: sata14:/ # cat /tmp/ibdiagnet.pkey GROUP PKey:0x7fff Hosts:4 Full sata15/P2 lid=0x0004 guid=0x00066a01a363 dev=23108 Full sata14/P2 lid=0x0006 guid=0x00066a01a2bf dev=23108 Full sata23/P2 lid=0x0008 guid=0x00066a01a2fe dev=23108 Full sata16/P2 lid=0x0007 guid=0x00066a01a2c1 dev=23108 When I run an ibdiagpath -l 0x0004 I get the following: -W- Topology file is not specified. Reports regarding cluster links will use direct routes. -I- Using port 2 as the local port. -I--- -I- Traversing the path from local to destination -I--- -I- From: lid=0x0006 guid=0x00066a01a2bf dev=23108 sata14/P2 -I- To: lid=0x0001 guid=0x00066a00c8000180 dev=5 Port=1 -I- From: lid=0x0001 guid=0x00066a00c8000180 dev=5 Port=2 -I- To: lid=0x0004 guid=0x00066a01a363 dev=23108 sata15/P2 -I--- -I- PM Counters Info -I--- -I- No illegal PM counters values were found -I--- -I- Path Partitions Report -I--- -I- Source sata14/P2 lid=0x0006 guid=0x00066a01a2bf dev=23108 Port 2 PKeys:0x -I- Destination sata15 lid=0x0004 guid=0x00066a01a363 dev=23108 PKeys:0x -I- Path shared PKeys: 0x -I--- -I- IPoIB Path Check -I--- -I- Subnet: IPv4 PKey:0x7fff QKey:0x MTU:2048Byte rate:10Gbps SL:0x00 -W- Port sata14/P2 lid=0x0006 guid=0x00066a01a2bf dev=23108 can not join due to rate:2.5Gbps group:10Gbps -W- Port sata15/P2 lid=0x0004 guid=0x00066a01a363 dev=23108 can not join due to rate:2.5Gbps group:10Gbps -E- No IPoIB Subnets found on Path! Nodes can not communicate via IPoIB! -I--- -I- QoS on Path Check -I--- -W- Blocked VLs:4 5 at node:sata14 lid=0x0006 guid=0x00066a01a2bf dev=23108 port:2 -W- Blocked VLs:4 5 at node: lid=0x0001 guid=0x00066a00c8000180 dev=5 port:2 -I- The following SLs can be used:0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 -I- Done. Run time was 0 seconds. That IPoIB Path Check looks a bit alarming. Anyone have any suggestions? Looks like you have a mixed rate set of ports so you need to configure the group to 2.5 Gbps. What SM are you using ? -- Hal b. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps?
On Fri, 2008-04-04 at 07:55 -0700, Hal Rosenstock wrote: Looks like you have a mixed rate set of ports so you need to configure the group to 2.5 Gbps. I'm a bit green with I/B, so please bear with me if you can. I do understand that there can be mixed rates depending on hardware. But the hardware guys assure me the cards in these machines should be able to do 10Gbps. Maybe they are wrong. The card is listing as: 06:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) What SM are you using ? That's a good question. I suspect it's running on the switch. I don't know any details on the switch (yet) though. I will need to engage the hardware folks to determine this. I did get an error when when ran ibdiagnet about more than 1 master SM running when I started opensmd on one of the nodes and none of the other nodes are running an SM so that only leaves the switch. In my limited exposure to IB, running the SM on the switch has always yielded bad results. I will see if I can get them to disable it. b. signature.asc Description: This is a digitally signed message part ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: [ewg] OFED March 24 meeting summary on OFED 1.4 plans
What I mean claim to support is to have more people to test with this config. --CQ -Original Message- From: Or Gerlitz [mailto:[EMAIL PROTECTED] Sent: Thursday, April 03, 2008 11:18 PM To: Tang, Changqing Cc: general@lists.openfabrics.org; [EMAIL PROTECTED] Subject: Re: [ofa-general] Re: [ewg] OFED March 24 meeting summary on OFED 1.4 plans On Thu, Apr 3, 2008 at 5:40 PM, Tang, Changqing [EMAIL PROTECTED] wrote: The problem is, from MPI side, (and by default), we don't know which port is on which fabric, since the subnet prefix is the same. We rely on system admin to config two different subnet prefixes for HP-MPI to work. No vendor has claimed to support this. CQ, not supporting a different subnet prefix per IB subnet is against IB nature, I don't think there should be any problem to configure a different prefix at each open SM instance and the Linux host stack would work perfectly under this config. If you are a ware to any problem in the opensm and/or the host stack please let the community know and the maintainers will fix it. Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps?
From: Hal Rosenstock Sent: Friday, April 04, 2008 11:08 AM To: Brian J. Murrell Cc: general@lists.openfabrics.org Subject: Re: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps? On Fri, 2008-04-04 at 11:05 -0400, Brian J. Murrell wrote: On Fri, 2008-04-04 at 07:55 -0700, Hal Rosenstock wrote: Looks like you have a mixed rate set of ports so you need to configure the group to 2.5 Gbps. I'm a bit green with I/B, so please bear with me if you can. I do understand that there can be mixed rates depending on hardware. But the hardware guys assure me the cards in these machines should be able to do 10Gbps. Maybe they are wrong. The card is listing as: 06:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) I would not recommend reconfiguring your SM for this situation. Instead, you most likely have a bad cable or possibly a bad HCA or switch port. All IB products shipped within the last 6 years support 10g, so the fact your system has negotiated to 2.5g indicates a problem with the link. Bad or poorly connected cables are the typical cause. Todd Rimmer Chief Architect QLogic System Interconnect Group Voice: 610-233-4852 Fax: 610-233-4777 [EMAIL PROTECTED] www.QLogic.com ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps?
On Fri, 2008-04-04 at 10:14 -0500, Todd Rimmer wrote: From: Hal Rosenstock Sent: Friday, April 04, 2008 11:08 AM To: Brian J. Murrell Cc: general@lists.openfabrics.org Subject: Re: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps? On Fri, 2008-04-04 at 11:05 -0400, Brian J. Murrell wrote: On Fri, 2008-04-04 at 07:55 -0700, Hal Rosenstock wrote: Looks like you have a mixed rate set of ports so you need to configure the group to 2.5 Gbps. I'm a bit green with I/B, so please bear with me if you can. I do understand that there can be mixed rates depending on hardware. But the hardware guys assure me the cards in these machines should be able to do 10Gbps. Maybe they are wrong. The card is listing as: 06:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) I would not recommend reconfiguring your SM for this situation. Instead, you most likely have a bad cable or possibly a bad HCA or switch port. All IB products shipped within the last 6 years support 10g, so the fact your system has negotiated to 2.5g indicates a problem with the link. Bad or poorly connected cables are the typical cause. Yes, this seems right; I misread this as the DDR/SDR issue. I would doubt he has any 1x hardware. -- Hal Todd Rimmer Chief Architect QLogic System Interconnect Group Voice: 610-233-4852 Fax: 610-233-4777 [EMAIL PROTECTED] www.QLogic.com ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps?
On Fri, 2008-04-04 at 10:14 -0500, Todd Rimmer wrote: I would not recommend reconfiguring your SM for this situation. Indeed, if what you say below pans out, I'd rather not. Instead, you most likely have a bad cable or possibly a bad HCA or switch port. All IB products shipped within the last 6 years support 10g, so the fact your system has negotiated to 2.5g indicates a problem with the link. OK. I will investigate this. Is there any more direct method of determining what rate an HCA has negotiated than using the ibdiagpath -l $nid mechanism that I have been using? It seems like a kind of round-about method of getting that information. Bad or poorly connected cables are the typical cause. I will have the hardware guys take another look at that. Thanx for all the pointers! b. signature.asc Description: This is a digitally signed message part ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: [ewg] OFED March 24 meeting summary on OFED 1.4 plans
for example, in MPI, process A know the HCA guid on another node. After running for some time, the switch is restarted for some reason, and the whole fabric is re-configured. CQ, If by the whole fabric is re-configured you refer to a case where a subnet prefix changes while a job runs and a process is detached/reattached to the job so now you want to adopt your design to handle it, is over engineering, why you want to do that? I am concerning the port lid change. It is always the best if a process can figure the info it needs by itself, SA query is the right way and is in IB spec. while it is possible to let processes to exchange information(port lid) again, but there are difficulties: during the middle of a long job run, it is hard to let two processes to coordinate such infomation exchange, and it requires a second channel to do so. If the second channel is IPoIB, it is broken as well, and we need to re-establish it again. I just ask for the SA functionalities. If it is not possible, we have to use a very complicated way to let HP-MPI to survive from network failure. --CQ Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps?
On Fri, 2008-04-04 at 11:25 -0400, Brian J. Murrell wrote: On Fri, 2008-04-04 at 10:14 -0500, Todd Rimmer wrote: I would not recommend reconfiguring your SM for this situation. Indeed, if what you say below pans out, I'd rather not. Instead, you most likely have a bad cable or possibly a bad HCA or switch port. All IB products shipped within the last 6 years support 10g, so the fact your system has negotiated to 2.5g indicates a problem with the link. OK. I will investigate this. Is there any more direct method of determining what rate an HCA has negotiated than using the ibdiagpath -l $nid mechanism that I have been using? It seems like a kind of round-about method of getting that information. Try ibcheckwidth for this particular problem Bad or poorly connected cables are the typical cause. I will have the hardware guys take another look at that. Thanx for all the pointers! b. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] linux-next: infiniband build failure
drivers/infiniband/hw/ehca/ehca_reqs.c: In function 'ehca_write_swqe': drivers/infiniband/hw/ehca/ehca_reqs.c:191: error: 'const struct ib_send_wr' has no member named 'imm_data' Oops, thanks, I forgot to run my cross-compile (and ehca is ppc only). Anyway, your fix is correct and I rolled it into my patch. Thanks! ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH/RFC 2/2] RDMA/amso1100: Add support for send with invalidate work requests
At 08:52 PM 4/3/2008, Roland Dreier wrote: But does this code start working if we add the two patches I posted? I don't understand how you could do anything useful with the current state of things plus send w/inval for amso1100. Does send w/inv actually work end-to-end on the Ammasso? Who's testing it? Just wondering. Tom. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?
If not, can some notes be sent to the list? I say lets learn from what you did so far... In my experience, getting code to work over both IB and iWARP isn't that hard. The main points are: - Use the RDMA CM for connection establishment (duh) - Memory regions used to receive RDMA read responses must have remote write permission (since in the iWARP protocol, RDMA read responses are basically the same as incoming RDMA write requests) - Active side of the connection must do the first operation - Don't use IB-specific features (atomics, immediate data) - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: linux-next: infiniband build failure
Roland wanted the ib patch to go through my tree, and I figure we will work out these issues during the 2 week merge window. Actually I said I was fine with whatever you wanted to do :) Given that the new device support for ipath seems to cause problems for ib-convert-struct-class_device-to-struct-device.patch, it seems it might be simpler for me to carry that in my tree. If someone sends me the latest patch I'll be happy to merge it in (and do the fixups for the ipath changes). Then the final struct class_device removal just needs to be merged late -- I'll send my tree to Linus to pull in the first day or two of the merge window so I shouldn't be a problem. Stephen, Greg, I really have the simplest job here managing my tree, compared to you two guys, so as before just let me know how you want to handle this ;) - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Hot video of your high school teacher
UUFyWibTLk Watch the video nowoOPqUUFyWib___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps?
On Fri, 2008-04-04 at 08:29 -0700, Hal Rosenstock wrote: Try ibcheckwidth for this particular problem Well, seems I solved the problem after finding the ibstatus command. Seems the hardware guys plugged port 2 into the switch because port 1 of one of the HCAs in one of the machines is broken. Thanx for all of the help! b. signature.asc Description: This is a digitally signed message part ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] ERR 0108: Unknown remote side
On Fri, 2008-04-04 at 11:47 +0200, Bernd Schubert wrote: Hello, opensm-3.2.1 logs some error messages like this: Apr 04 00:00:08 325114 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side for node 0 x000b8c002ba2(SW_pfs1_leaf4) port 13. Adding to light sweep sampling list Apr 04 00:00:08 325126 [4580A960] 0x01 - Directed Path Dump of 3 hop path: Path = 0,1,14,13 From ibnetdiscover output I see port13 of this switch is a switch-interconnect (sorry, I don't know what the correct name/identifier for switches within switches): [13]S-000b8c002bfa[13]# SW_pfs1_inter7 lid 263 4xSDR Apr 04 00:00:08 325219 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side for node 0 x000b8c002bf9(SW_pfs1_inter6) port 9. Adding to light sweep sampling list Apr 04 00:00:08 325234 [4580A960] 0x01 - Directed Path Dump of 2 hop path: Path = 0,1,18 This is again an interconnection: [9] S-000b8c002b9e[15]# SW_pfs1_leaf1 lid 177 4xDDR Apr 04 00:00:08 325288 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side for node 0 x000b8c002bfa(SW_pfs1_inter7) port 13. Adding to light sweep sampling list Apr 04 00:00:08 325301 [4580A960] 0x01 - Directed Path Dump of 2 hop path: Path = 0,1,14 And again an interconnection: [13]S-000b8c002ba2[13]# SW_pfs1_leaf4 lid 182 4xDDR All the other interconnections seem to be fine. Any idea if OpenSM 3.1.10 has the same issue as 3.2.1 ? Is this some large Flextronics switch ? -- Hal Thanks, Bernd ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] AMSO1100: Add check for NULL reply_msg in c2_intr
AMSO1100: Add check for NULL reply_msg in c2_intr This is a checker-found bug posted to bugzilla.kernel.org (7478). Upon inspection I also found a place where we could attempt to kmem_cache_free a null pointer. Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- Roland, I don't think anyone has ever hit this bug, so it is a low priority in my view. I also noticed that if we refactored vq_wait_for_reply that we could combine a common if (!reply) { err = -ENOMEM; goto bail; } construct by guaranteeing that reply is non-null if vq_wait_for_reply returns without an error. This patch, however, is much smaller. What do you think? drivers/infiniband/hw/amso1100/c2_cq.c |4 ++-- drivers/infiniband/hw/amso1100/c2_intr.c |6 +- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c index d2b3366..bb17cce 100644 --- a/drivers/infiniband/hw/amso1100/c2_cq.c +++ b/drivers/infiniband/hw/amso1100/c2_cq.c @@ -422,8 +422,8 @@ void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq) goto bail1; reply = (struct c2wr_cq_destroy_rep *) (unsigned long) (vq_req-reply_msg); - - vq_repbuf_free(c2dev, reply); + if (reply) + vq_repbuf_free(c2dev, reply); bail1: vq_req_free(c2dev, vq_req); bail0: diff --git a/drivers/infiniband/hw/amso1100/c2_intr.c b/drivers/infiniband/hw/amso1100/c2_intr.c index 0d0bc33..3b50954 100644 --- a/drivers/infiniband/hw/amso1100/c2_intr.c +++ b/drivers/infiniband/hw/amso1100/c2_intr.c @@ -174,7 +174,11 @@ static void handle_vq(struct c2_dev *c2dev, u32 mq_index) return; } - err = c2_errno(reply_msg); + if (reply_msg) + err = c2_errno(reply_msg); + else + err = -ENOMEM; + if (!err) switch (req-event) { case IW_CM_EVENT_ESTABLISHED: c2_set_qp_state(req-qp, ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] error with ibv_poll_cq() call
OK, I committed my change to libmlx4 and the equivalent thing for libmthca. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH] AMSO1100: Add check for NULL reply_msg in c2_intr
I don't think anyone has ever hit this bug, so it is a low priority in my view. I also noticed that if we refactored vq_wait_for_reply that we could combine a common if (!reply) { err = -ENOMEM; goto bail; } construct by guaranteeing that reply is non-null if vq_wait_for_reply returns without an error. This patch, however, is much smaller. What do you think? Well, now is a good time to merge either version of the fix. Would be nice to kill off one of the Coverity issues so I'm happy to take this. It's up to you how much effort you want to spend on this... the refactoring sounds nice but I think we're OK without it. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] where to report bugs?
I'm wondering what the official mechanism is to report bugs? Just about anything I'm going to find is likely to be limited to build and installation bugs, like this one... In infiniband-diags-1.3.6/Makefile.am we have the line: INCLUDES = -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband This is assuming that other OFED packages have been installed in the general system $PREFIX, usually /usr as $includedir should be /usr/include. But in particular, I have installed the opensm{,-devel} in an alternate location (i.e. PREFIX) and the infiniband-diags build fails with: if gcc -DHAVE_CONFIG_H -I. -I. -I. -I./include -I/usr/include -I/usr/include/infiniband -I/home/brian/ofed_1.3_integration/tree/usr/include -Wall -I/home/brian/ofed_1.3_integration/tree/usr/include -O2 -g -fmessage-length=0 -D_FORTIFY_SOURCE=2 -MT src_ibnetdiscover-ibnetdiscover.o -MD -MP -MF .deps/src_ibnetdiscover-ibnetdiscover.Tpo -c -o src_ibnetdiscover-ibnetdiscover.o `test -f 'src/ibnetdiscover.c' || echo './'`src/ibnetdiscover.c; \ then mv -f .deps/src_ibnetdiscover-ibnetdiscover.Tpo .deps/src_ibnetdiscover-ibnetdiscover.Po; else rm -f .deps/src_ibnetdiscover-ibnetdiscover.Tpo; exit 1; fi In file included from src/ibnetdiscover.c:53: /home/brian/ofed_1.3_integration/tree/usr/include/infiniband/complib/cl_nodenamemap.h:39:29: error: complib/cl_qmap.h: No such file or directory In file included from src/ibnetdiscover.c:53: /home/brian/ofed_1.3_integration/tree/usr/include/infiniband/complib/cl_nodenamemap.h:45: error: expected specifier-qualifier-list before ‘cl_map_item_t’ /home/brian/ofed_1.3_integration/tree/usr/include/infiniband/complib/cl_nodenamemap.h:51: error: expected specifier-qualifier-list before ‘cl_qmap_t’ make[1]: *** [src_ibnetdiscover-ibnetdiscover.o] Error 1 make[1]: Leaving directory `/home/brian/rpm/BUILD/infiniband-diags-1.3.6' On my system, with opensm-devel (and all other OFED RPMs) installed in an alternate PREFIX, the above list of include paths should be s#/usr/include/infiniband#PREFIX/include/infiniband#. It seems probably infiniband-diags needs to have the same --with-osm switch that ibutils has. b. signature.asc Description: This is a digitally signed message part ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] InfiniBand/iWARP/RDMA merge plans for 2.6.26 (what's in infiniband.git)
We want to add send with invalidate mask compare and swap. Eli will be able to send the patches next week and since they are small I think they can be in for 2.6.26 We are very interested in these new operations and are moving in the direction of tightly integrating RDMA along with atomics (if available) into Oracle. We plan on testing some early prototypes of the these in the few months. Send with invalidate is an exact match for our current RDS V3 rdma driver - and should be more efficient than the current background syncing of the tpt to ensure keys are invalidated. We intend on exposing the atomics via the RDS driver along with simple low level rdma operations to Oracle's internal clients. If Oracle is running over a transport which exports atomics and rdma - Oracle will see a dramatic performance boost for several database operations. Roland Dreier wrote: We want to add send with invalidate mask compare and swap. Eli will be able to send the patches next week and since they are small I think they can be in for 2.6.26 Send with invalidate should be OK. Let's see about the masked atomics stuff -- we have a ton of new verbs and I think we might want to slow down and make sure it all makes sense. What about the split CQ for UD mode? It's improved the IPoIB performance for small messages significantly. Oh yeah... I'll try to get that in too. mlx4- we plan to send patches for the low level driver only to enable mlx4_en. These only affect our low level driver. No problem in principle, let's see the actual patches. I think we should try to push for XEC in 2.6.26 since there are already MPI implementation that use it and this ties them to use OFED only. Also this feature is stable and now being defined in IBTA Not taking it causing changes between OFED and the kernel and your libibverbs and we wish to avoid such gaps. Is there any thing we can do to help and make it into 2.6.26? I don't have a good feeling that the user-kernel interface is well thought out, so I want to consider XRC + ehca LL stuff + new iWARP verbs and make sure we have something that makes sense for the future. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH] AMSO1100: Add check for NULL reply_msg in c2_intr
On Fri, 2008-04-04 at 12:22 -0700, Roland Dreier wrote: I don't think anyone has ever hit this bug, so it is a low priority in my view. I also noticed that if we refactored vq_wait_for_reply that we could combine a common if (!reply) { err = -ENOMEM; goto bail; } construct by guaranteeing that reply is non-null if vq_wait_for_reply returns without an error. This patch, however, is much smaller. What do you think? Well, now is a good time to merge either version of the fix. Would be nice to kill off one of the Coverity issues so I'm happy to take this. It's up to you how much effort you want to spend on this... the refactoring sounds nice but I think we're OK without it. I'm up to my eyeballs right now. If it's ok with you I'd say defer the refactoring. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] InfiniBand/iWARP/RDMA merge plans for 2.6.26 (what's in infiniband.git)
We are very interested in these new operations and are moving in the direction of tightly integrating RDMA along with atomics (if available) into Oracle. We plan on testing some early prototypes of the these in the few months. And you need the ConnectX-only masked atomics? Or do the standard IB atomic operations work for you? Of course using atomics at all means that things don't work on iWARP. Send with invalidate is an exact match for our current RDS V3 rdma driver - and should be more efficient than the current background syncing of the tpt to ensure keys are invalidated. How does send with invalidate interact with the current IB FMR stuff? Seems that you would run into trouble keeping the state of the FMR straight if the remote side is invalidating them. Also I would think that send-with-invalidate would be much more expensive than the current FMR method of batching up the invalidates, since you don't get to amortize the cost of syncing up all the internal HCA state. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH] AMSO1100: Add check for NULL reply_msg in c2_intr
I'm up to my eyeballs right now. If it's ok with you I'd say defer the refactoring. No problem, I'll queue this up and if you ever get time to work on amso1100 you can send the refactoring. But are you working on a pmtu fix? - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 7/10] IB/ipoib: Add ethtool support
thanks, applied ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] where to report bugs?
On Fri, 2008-04-04 at 15:24 -0400, Brian J. Murrell wrote: I'm wondering what the official mechanism is to report bugs? http://www.openfabrics.org/bugzilla but that's usually used when email is insufficient and some issue needs tracking but it's up to you. -- Hal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 10/10] IB/mlx4: add support for modifying CQ parameters
thanks, I applied 8/10 and 9/10, and changed this one around a bit before applying it... it seemed cleaner to me not to expose the CQ context to the mlx4_ib driver. For CQ resize we can just add a new mlx4_cq_resize() function in mlx4_core, since the context parameters that matter there are completely different. (And there's no need for mlx4_ib to worry about either the modify moderation or resize cases) From a1f375e52ce0b39bebaa27adc6d3724816f7e963 Mon Sep 17 00:00:00 2001 From: Eli Cohen [EMAIL PROTECTED] Date: Mon, 17 Mar 2008 17:24:25 +0200 Subject: [PATCH] IB/mlx4: Add support for modifying CQ moderation parameters Signed-off-by: Eli Cohen [EMAIL PROTECTED] Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- drivers/infiniband/hw/mlx4/cq.c |8 drivers/infiniband/hw/mlx4/main.c|1 + drivers/infiniband/hw/mlx4/mlx4_ib.h |1 + drivers/net/mlx4/cq.c| 31 +++ include/linux/mlx4/cmd.h |2 +- include/linux/mlx4/cq.h |3 +++ 6 files changed, 45 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 7d70af7..e4fb64b 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -85,6 +85,14 @@ static struct mlx4_cqe *next_cqe_sw(struct mlx4_ib_cq *cq) return get_sw_cqe(cq, cq-mcq.cons_index); } +int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period) +{ + struct mlx4_ib_cq *mcq = to_mcq(cq); + struct mlx4_ib_dev *dev = to_mdev(cq-device); + + return mlx4_cq_modify(dev-dev, mcq-mcq, cq_count, cq_period); +} + struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector, struct ib_ucontext *context, struct ib_udata *udata) diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index e9330a0..76dd45c 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -609,6 +609,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) ibdev-ib_dev.post_send = mlx4_ib_post_send; ibdev-ib_dev.post_recv = mlx4_ib_post_recv; ibdev-ib_dev.create_cq = mlx4_ib_create_cq; + ibdev-ib_dev.modify_cq = mlx4_ib_modify_cq; ibdev-ib_dev.destroy_cq= mlx4_ib_destroy_cq; ibdev-ib_dev.poll_cq = mlx4_ib_poll_cq; ibdev-ib_dev.req_notify_cq = mlx4_ib_arm_cq; diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 3f8bd0a..ef8ad96 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -254,6 +254,7 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, struct ib_udata *udata); int mlx4_ib_dereg_mr(struct ib_mr *mr); +int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period); struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector, struct ib_ucontext *context, struct ib_udata *udata); diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c index d4441fe..00a270b 100644 --- a/drivers/net/mlx4/cq.c +++ b/drivers/net/mlx4/cq.c @@ -121,6 +121,13 @@ static int mlx4_SW2HW_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, MLX4_CMD_TIME_CLASS_A); } +static int mlx4_MODIFY_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, +int cq_num, u32 opmod) +{ + return mlx4_cmd(dev, mailbox-dma, cq_num, opmod, MLX4_CMD_MODIFY_CQ, + MLX4_CMD_TIME_CLASS_A); +} + static int mlx4_HW2SW_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, int cq_num) { @@ -129,6 +136,30 @@ static int mlx4_HW2SW_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, MLX4_CMD_TIME_CLASS_A); } +int mlx4_cq_modify(struct mlx4_dev *dev, struct mlx4_cq *cq, + u16 count, u16 period) +{ + struct mlx4_cmd_mailbox *mailbox; + struct mlx4_cq_context *cq_context; + int err; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + + cq_context = mailbox-buf; + memset(cq_context, 0, sizeof *cq_context); + + cq_context-cq_max_count = cpu_to_be16(count); + cq_context-cq_period= cpu_to_be16(period); + + err = mlx4_MODIFY_CQ(dev, mailbox, cq-cqn, 1); + + mlx4_free_cmd_mailbox(dev, mailbox); + return err; +} +EXPORT_SYMBOL_GPL(mlx4_cq_modify); + int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq) { diff --git a/include/linux/mlx4/cmd.h b/include/linux/mlx4/cmd.h index 7d1eaa9..77323a7
[ofa-general] MVAPICH2 crashes on mixed fabric
Hey, all, I'm not sure if this is a known bug or some sort of limitation I'm unaware of, but I've been building and testing with the OFED 1.3 GA release on a small fabric that has a mix of Arbel-based and newer Connect-X HCAs. What I've discovered is that mvapich and openmpi work fine across the entire fabric, but mvapich2 crashes when I use a mix of Arbels and Connect-X. The errors vary depending on the test program but here's an example: [EMAIL PROTECTED] IMB-3.0]$ mpirun -n 5 ./IMB-MPI1 . . . (output snipped) . . . #--- -- # Benchmarking Sendrecv # #processes = 2 # ( 3 additional processes waiting in MPI_Barrier) #--- -- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec 0 1000 3.51 3.51 3.51 0.00 1 1000 3.63 3.63 3.63 0.52 2 1000 3.67 3.67 3.67 1.04 4 1000 3.64 3.64 3.64 2.09 8 1000 3.67 3.67 3.67 4.16 16 1000 3.67 3.67 3.67 8.31 32 1000 3.74 3.74 3.74 16.32 64 1000 3.90 3.90 3.90 31.28 128 1000 4.75 4.75 4.75 51.39 256 1000 5.21 5.21 5.21 93.79 512 1000 5.96 5.96 5.96 163.77 1024 1000 7.88 7.89 7.89 247.54 2048 100011.4211.4211.42 342.00 4096 100015.3315.3315.33 509.49 8192 100022.1922.2022.20 703.83 16384 100034.5734.5734.57 903.88 32768 100051.3251.3251.32 1217.94 65536 64085.8085.8185.80 1456.74 131072 320 155.23 155.24 155.24 1610.40 262144 160 301.84 301.86 301.85 1656.39 524288 80 598.62 598.69 598.66 1670.31 1048576 40 1175.22 1175.30 1175.26 1701.69 2097152 20 2309.05 2309.05 2309.05 1732.32 4194304 10 4548.72 4548.98 4548.85 1758.64 [0] Abort: Got FATAL event 3 at line 796 in file ibv_channel_manager.c rank 0 in job 1 compute-0-0.local_36049 caused collective abort of all ranks exit status of rank 0: killed by signal 9 If, however, I define my mpdring to contain only Connect-X systems OR only Arbel systems, IMB-MPI1 runs to completion. Can any suggest a workaround or is this a real bug with mvapich2? -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] mmu notifier #v11
This should guarantee that nobody can register when any of the mmu notifiers is running avoiding all the races including guaranteeing range_start not to be missed. I'll adapt the other patches to provide the sleeping-feature on top of this (only needed by XPMEM) soon. KVM seems to run fine on top of this one. Andrew can you apply this to -mm? Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] Signed-off-by: Christoph Lameter [EMAIL PROTECTED] diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1050,6 +1050,9 @@ unsigned long addr, unsigned long len, unsigned long flags, struct page **pages); +extern void mm_lock(struct mm_struct *mm); +extern void mm_unlock(struct mm_struct *mm); + extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -225,6 +225,9 @@ #ifdef CONFIG_CGROUP_MEM_RES_CTLR struct mem_cgroup *mem_cgroup; #endif +#ifdef CONFIG_MMU_NOTIFIER + struct hlist_head mmu_notifier_list; +#endif }; #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h new file mode 100644 --- /dev/null +++ b/include/linux/mmu_notifier.h @@ -0,0 +1,175 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +#include linux/list.h +#include linux/spinlock.h +#include linux/mm_types.h + +struct mmu_notifier; +struct mmu_notifier_ops; + +#ifdef CONFIG_MMU_NOTIFIER + +struct mmu_notifier_ops { + /* +* Called when nobody can register any more notifier in the mm +* and after the mn notifier has been disarmed already. +*/ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + /* +* clear_flush_young is called after the VM is +* test-and-clearing the young/accessed bitflag in the +* pte. This way the VM will provide proper aging to the +* accesses to the page through the secondary MMUs and not +* only to the ones through the Linux pte. +*/ + int (*clear_flush_young)(struct mmu_notifier *mn, +struct mm_struct *mm, +unsigned long address); + + /* +* Before this is invoked any secondary MMU is still ok to +* read/write to the page previously pointed by the Linux pte +* because the old page hasn't been freed yet. If required +* set_page_dirty has to be called internally to this method. +*/ + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* +* invalidate_range_start() and invalidate_range_end() must be +* paired. Multiple invalidate_range_start/ends may be nested +* or called concurrently. +*/ + void (*invalidate_range_start)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); + void (*invalidate_range_end)(struct mmu_notifier *mn, +struct mm_struct *mm, +unsigned long start, unsigned long end); +}; + +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +static inline int mm_has_notifiers(struct mm_struct *mm) +{ + return unlikely(!hlist_empty(mm-mmu_notifier_list)); +} + +extern void mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void __mmu_notifier_release(struct mm_struct *mm); +extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end); +extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end); + + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_release(mm); +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + return
Re: [ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?
On Fri, Apr 4, 2008 at 7:06 PM, Roland Dreier [EMAIL PROTECTED] wrote: - Don't use IB-specific features (atomics, immediate data) and don't use RNRs as a means for HW based flow control mechanism. The current RDS implementation does not have a SW based flow control but rather does some sort of back pressure through SW based congestion management. I think that to some extent it relies on RNRs which don't exist under iWARP. Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?
On Fri, Apr 4, 2008 at 5:41 PM, Steve Wise [EMAIL PROTECTED] wrote: We won't be in Sonoma, but perhaps Jon can email some info to the list on what we've done to-date for open mpi. This would be very much helpful, best if done before Monday so we can discuss there the RDS port with the maintainer. Jon - any chance you will be able to send something (even raw, sketch)? Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?
Hmmm - so what happens with IWARP NIC when no buffer is posted on recv q and a message arrives ? Or Gerlitz wrote: On Fri, Apr 4, 2008 at 7:06 PM, Roland Dreier [EMAIL PROTECTED] wrote: - Don't use IB-specific features (atomics, immediate data) and don't use RNRs as a means for HW based flow control mechanism. The current RDS implementation does not have a SW based flow control but rather does some sort of back pressure through SW based congestion management. I think that to some extent it relies on RNRs which don't exist under iWARP. Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?
How about a pointer to an IWARP spec - so we can sort out all the details.../ implications...to RDS. Or Gerlitz wrote: On Fri, Apr 4, 2008 at 5:41 PM, Steve Wise [EMAIL PROTECTED] wrote: We won't be in Sonoma, but perhaps Jon can email some info to the list on what we've done to-date for open mpi. This would be very much helpful, best if done before Monday so we can discuss there the RDS port with the maintainer. Jon - any chance you will be able to send something (even raw, sketch)? Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?
On Sat, Apr 5, 2008 at 12:27 AM, Richard Frank [EMAIL PROTECTED] wrote: Hmmm - so what happens with IWARP NIC when no buffer is posted on recv q and a message arrives ? I am quite sure the L2 ethernet HW just drops it, but you better verify this with an iWARP HW provider. Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] where to report bugs?
On Fri, 04 Apr 2008 15:24:28 -0400 Brian J. Murrell [EMAIL PROTECTED] wrote: I'm wondering what the official mechanism is to report bugs? Just about anything I'm going to find is likely to be limited to build and installation bugs, like this one... In infiniband-diags-1.3.6/Makefile.am we have the line: INCLUDES = -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband This is assuming that other OFED packages have been installed in the general system $PREFIX, usually /usr as $includedir should be /usr/include. But in particular, I have installed the opensm{,-devel} in an alternate location (i.e. PREFIX) and the infiniband-diags build fails with: Are you specifying --prefix on the infiniband-diags configure? I think that should work. Ira if gcc -DHAVE_CONFIG_H -I. -I. -I. -I./include -I/usr/include -I/usr/include/infiniband -I/home/brian/ofed_1.3_integration/tree/usr/include -Wall -I/home/brian/ofed_1.3_integration/tree/usr/include -O2 -g -fmessage-length=0 -D_FORTIFY_SOURCE=2 -MT src_ibnetdiscover-ibnetdiscover.o -MD -MP -MF .deps/src_ibnetdiscover-ibnetdiscover.Tpo -c -o src_ibnetdiscover-ibnetdiscover.o `test -f 'src/ibnetdiscover.c' || echo './'`src/ibnetdiscover.c; \ then mv -f .deps/src_ibnetdiscover-ibnetdiscover.Tpo .deps/src_ibnetdiscover-ibnetdiscover.Po; else rm -f .deps/src_ibnetdiscover-ibnetdiscover.Tpo; exit 1; fi In file included from src/ibnetdiscover.c:53: /home/brian/ofed_1.3_integration/tree/usr/include/infiniband/complib/cl_nodenamemap.h:39:29: error: complib/cl_qmap.h: No such file or directory In file included from src/ibnetdiscover.c:53: /home/brian/ofed_1.3_integration/tree/usr/include/infiniband/complib/cl_nodenamemap.h:45: error: expected specifier-qualifier-list before ‘cl_map_item_t’ /home/brian/ofed_1.3_integration/tree/usr/include/infiniband/complib/cl_nodenamemap.h:51: error: expected specifier-qualifier-list before ‘cl_qmap_t’ make[1]: *** [src_ibnetdiscover-ibnetdiscover.o] Error 1 make[1]: Leaving directory `/home/brian/rpm/BUILD/infiniband-diags-1.3.6' On my system, with opensm-devel (and all other OFED RPMs) installed in an alternate PREFIX, the above list of include paths should be s#/usr/include/infiniband#PREFIX/include/infiniband#. It seems probably infiniband-diags needs to have the same --with-osm switch that ibutils has. b. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] where to report bugs?
On Fri, 2008-04-04 at 13:31 -0700, Ira Weiny wrote: Are you specifying --prefix on the infiniband-diags configure? Ahhh. That would have the undesired effect of relocating my infiniband-diags wherever I specify --prefix. This is not quite what I want. The ugly details are about to come out. The problem is that I am not setting a --prefix when I build any of the prerequisite packages (i.e. opensm, the libraries it depends on, etc.) as I want everything to actually have a /usr prefix, however for the purposes of building this stack from the downloadable package of what's basically SRPMs, I install the prerequisites into a temporary path. So I have a dir ./tree/ in which I use rpm2cpio $rpm | cpio -id to roll the packages into and then point the various configure scripts to using various --with-* options. This method has worked so far for: SRPMS/libibcommon-1.0.8-1.ofed1.3 SRPMS/libibumad-1.1.7-1.ofed1.3 SRPMS/opensm-3.1.10-1.ofed1.3 SRPMS/ibutils-1.2-1.ofed1.3 SRPMS/libibmad-1.1.6-1.ofed1.3 The overall problem is that I cannot taint my pristine build environment by going along the normal process of build rpm, install it, build next rpm, install it, etc., so I have to install prerequisite RPMs into a sandbox and point subsequent users (in the build process) of it into the sandbox. b. signature.asc Description: This is a digitally signed message part ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] ofed works on kernels with 64Kbyte pages?
I know it's a long shot, but has anyone tried using OFED on a kernel with 64Kbyte pages? SGI would like to support that, but I've gotten reports that something is not working (e.g., ib_rdma_bw doesn't work on an ia64 kernel with 64Kb pages). This is with the mthca driver, fwiw. Unfortunately a conspiracy of h/w prevents me from reproducing this right now, so I don't have more details. But I'd be very curious to know if anyone can verify that OFED does/doesn't work with 64Kbyte pages. -- Arthur ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?
How about a pointer to an IWARP spec - so we can sort out all the details.../ implications...to RDS. www.rdmaconsortium.org has most of it... the verbs are at: http://www.rdmaconsortium.org/home/draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf the iWARP RDMA protocol is RFC 5040 et al: http://www.ietf.org/rfc/rfc5040.txt (the next few RFCs have lower-level details) ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?
Hmmm - so what happens with IWARP NIC when no buffer is posted on recv q and a message arrives ? I am quite sure the L2 ethernet HW just drops it, but you better verify this with an iWARP HW provider. Why would it be dropped at L2? What I believe will happen is that it will generate an error at the DDP layer that will probably result in the connection being closed. Section 7.1 of RFC 5041 says: For non-zero-length Untagged DDP Segments, the DDP Segment MUST be validated before Placement by verifying: [untagged DDP segments are incoming send data, as vs. tagged RDMA operations] 2. The QN and MSN have an associated buffer that allows Placement of the payload. Implementers' note: DDP implementations SHOULD consider lack of an associated buffer as a system fault. DDP implementations MAY try to recover from the system fault using LLP means in a ULP- transparent way. DDP implementations SHOULD NOT permit system faults to occur repeatedly or frequently. If there is not an associated buffer, DDP implementations MAY choose to disable the stream for the reception and report an error to the ULP at the Data Sink. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] ofed works on kernels with 64Kbyte pages?
I know it's a long shot, but has anyone tried using OFED on a kernel with 64Kbyte pages? SGI would like to support that, but I've gotten reports that something is not working (e.g., ib_rdma_bw doesn't work on an ia64 kernel with 64Kb pages). This is with the mthca driver, fwiw. Unfortunately a conspiracy of h/w prevents me from reproducing this right now, so I don't have more details. But I'd be very curious to know if anyone can verify that OFED does/doesn't work with 64Kbyte pages. I don't know about OFED, but I've tried various things on 64KB PAGE_SIZE systems and it seems to work. It wouldn't surprise me if there are issues since the drivers and firmware gets a lot less testing in such situations but it should work -- I'd be happy to help debug if anyone has concrete problems. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] where to report bugs?
On Fri, 04 Apr 2008 16:43:07 -0400 Brian J. Murrell [EMAIL PROTECTED] wrote: On Fri, 2008-04-04 at 13:31 -0700, Ira Weiny wrote: Are you specifying --prefix on the infiniband-diags configure? Ahhh. That would have the undesired effect of relocating my infiniband-diags wherever I specify --prefix. This is not quite what I want. The ugly details are about to come out. The problem is that I am not setting a --prefix when I build any of the prerequisite packages (i.e. opensm, the libraries it depends on, etc.) as I want everything to actually have a /usr prefix, however for the purposes of building this stack from the downloadable package of what's basically SRPMs, I install the prerequisites into a temporary path. So I have a dir ./tree/ in which I use rpm2cpio $rpm | cpio -id to roll the packages into and then point the various configure scripts to using various --with-* options. This method has worked so far for: SRPMS/libibcommon-1.0.8-1.ofed1.3 SRPMS/libibumad-1.1.7-1.ofed1.3 SRPMS/opensm-3.1.10-1.ofed1.3 SRPMS/ibutils-1.2-1.ofed1.3 SRPMS/libibmad-1.1.6-1.ofed1.3 The overall problem is that I cannot taint my pristine build environment by going along the normal process of build rpm, install it, build next rpm, install it, etc., so I have to install prerequisite RPMs into a sandbox and point subsequent users (in the build process) of it into the sandbox. So I guess you want something like: export CPPFLAGS=-Isandbox_dir/include Before you do the configure and build? Ira ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 17/20] IB/ipath - user mode send DMA
By the way... +int ipath_user_sdma_pkt_sent(const struct ipath_user_sdma_queue *pq, + u32 counter) +{ +const u32 scounter = ipath_user_sdma_complete_counter(pq); +const s32 dcounter = scounter - counter; + +return dcounter = 0; +} I don't see this called anywhere... should I just delete it? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] where to report bugs?
On Fri, 2008-04-04 at 14:06 -0700, Ira Weiny wrote: So I guess you want something like: export CPPFLAGS=-Isandbox_dir/include CPPFLAGS or CFLAGS? I could see it being the former but I used the latter. Before you do the configure and build? That is in effect exactly what I did to deal with this issue. I just didn't find it very elegant. But if that is how the package is meant to operate, that is fine. If it were CFLAGS you were promoting the setting of I would be a bit more sticky because RPM wants to have the CFLAGS for it's own use: $ rpm --eval=%configure CFLAGS=${CFLAGS:--O2 -g -fmessage-length=0 -D_FORTIFY_SOURCE=2} ; export CFLAGS ; CXXFLAGS=${CXXFLAGS:--O2 -g -fmessage-length=0 -D_FORTIFY_SOURCE=2} ; export CXXFLAGS ; FFLAGS=${FFLAGS:--O2 -g -fmessage-length=0 -D_FORTIFY_SOURCE=2} ; export FFLAGS ; ./configure --host=x86_64-suse-linux --build=x86_64-suse-linux \ --target=x86_64-suse-linux \ --program-prefix= \ ... And while, yes, you can override CFLAGS and the %configure macro will use it, I'd rather defer the CFLAGS to whatever the vendor has put into the RPM macros file(s). b. signature.asc Description: This is a digitally signed message part ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 19/20] IB/ipath - add calls to new 7220 code and enable in build
+enum ib_rate ipath_mult_to_ib_rate(unsigned mult) +{ +switch (mult) { +case 8: return IB_RATE_2_5_GBPS; +case 4: return IB_RATE_5_GBPS; +case 2: return IB_RATE_10_GBPS; +case 1: return IB_RATE_20_GBPS; +default: return IB_RATE_PORT_CURRENT; +} +} Looks suspiciously like a copy of the existing mult_to_ib_rate() except it handles fewer cases... is there a reason to copy this? - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 17/20] IB/ipath - user mode send DMA
+void ipath_user_sdma_set_complete_counter(struct ipath_user_sdma_queue *pq, + u32 c) +{ +pq-sent_counter = c; +} This is only used in one file... OK to make it static? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/1 v1] MLX4: Added resize_cq capability.
Thanks, I applied this with a lot of changes. Some comments: entries = roundup_pow_of_two(entries + 1); your patch was corrupted in a very strange way... the context lines had two spaces instead of one at the beginning. I just deleted the extra space by hand. +err = mlx4_alloc_cq_buf(dev, cq-resize_buf-buf, entries); +if (err) { +spin_lock_irq(cq-lock); +kfree(cq-resize_buf); +cq-resize_buf = NULL; +spin_unlock_irq(cq-lock); +goto out; +} +err_buf: +if (cq-resize_buf) { +if (!ibcq-uobject) +mlx4_free_cq_buf(dev, cq-resize_buf-buf, + cq-resize_buf-cqe); + +spin_lock_irq(cq-lock); +kfree(cq-resize_buf); +cq-resize_buf = NULL; +spin_unlock_irq(cq-lock); +} Why do we need the spinlock in these places? There's no way for this to race with mlx4_ib_poll_one() is there, since that should never see the RESIZE CQE? (If there is such a race, then we're in trouble even with the lock, since we're aborting the resize, and the poll code shouldn't swap the buffers) Also I got rid of the duplicated code to allocate buffers and get userspace buffers, so that the allocate and resize paths use the same code. And I cleaned up some other stuff. So please review/test my work to make sure I didn't break your patch... --- drivers/infiniband/hw/mlx4/cq.c | 292 ++ drivers/infiniband/hw/mlx4/main.c|2 + drivers/infiniband/hw/mlx4/mlx4_ib.h |9 + drivers/net/mlx4/cq.c| 28 include/linux/mlx4/cq.h |2 + 5 files changed, 300 insertions(+), 33 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index e4fb64b..3557e7e 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -93,6 +93,74 @@ int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period) return mlx4_cq_modify(dev-dev, mcq-mcq, cq_count, cq_period); } +static int mlx4_ib_alloc_cq_buf(struct mlx4_ib_dev *dev, struct mlx4_ib_cq_buf *buf, int nent) +{ + int err; + + err = mlx4_buf_alloc(dev-dev, nent * sizeof(struct mlx4_cqe), +PAGE_SIZE * 2, buf-buf); + + if (err) + goto out; + + err = mlx4_mtt_init(dev-dev, buf-buf.npages, buf-buf.page_shift, + buf-mtt); + if (err) + goto err_buf; + + err = mlx4_buf_write_mtt(dev-dev, buf-mtt, buf-buf); + if (err) + goto err_mtt; + + return 0; + +err_mtt: + mlx4_mtt_cleanup(dev-dev, buf-mtt); + +err_buf: + mlx4_buf_free(dev-dev, nent * sizeof(struct mlx4_cqe), + buf-buf); + +out: + return err; +} + +static void mlx4_ib_free_cq_buf(struct mlx4_ib_dev *dev, struct mlx4_ib_cq_buf *buf, int cqe) +{ + mlx4_buf_free(dev-dev, (cqe + 1) * sizeof(struct mlx4_cqe), buf-buf); +} + +static int mlx4_ib_get_cq_umem(struct mlx4_ib_dev *dev, struct ib_ucontext *context, + struct mlx4_ib_cq_buf *buf, struct ib_umem **umem, + u64 buf_addr, int cqe) +{ + int err; + + *umem = ib_umem_get(context, buf_addr, cqe * sizeof (struct mlx4_cqe), + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(*umem)) + return PTR_ERR(*umem); + + err = mlx4_mtt_init(dev-dev, ib_umem_page_count(*umem), + ilog2((*umem)-page_size), buf-mtt); + if (err) + goto err_buf; + + err = mlx4_ib_umem_write_mtt(dev, buf-mtt, *umem); + if (err) + goto err_mtt; + + return 0; + +err_mtt: + mlx4_mtt_cleanup(dev-dev, buf-mtt); + +err_buf: + ib_umem_release(*umem); + + return err; +} + struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector, struct ib_ucontext *context, struct ib_udata *udata) @@ -100,7 +168,6 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector struct mlx4_ib_dev *dev = to_mdev(ibdev); struct mlx4_ib_cq *cq; struct mlx4_uar *uar; - int buf_size; int err; if (entries 1 || entries dev-dev-caps.max_cqes) @@ -112,8 +179,10 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector entries = roundup_pow_of_two(entries + 1); cq-ibcq.cqe = entries - 1; - buf_size = entries * sizeof (struct mlx4_cqe); + mutex_init(cq-resize_mutex); spin_lock_init(cq-lock); + cq-resize_buf = NULL; + cq-resize_umem = NULL; if (context) { struct mlx4_ib_create_cq ucmd; @@ -123,21 +192,10 @@ struct
Re: [ofa-general] InfiniBand/iWARP/RDMA merge plans for 2.6.26 (what's in infiniband.git)
Roland Dreier wrote: We are very interested in these new operations and are moving in the direction of tightly integrating RDMA along with atomics (if available) into Oracle. We plan on testing some early prototypes of the these in the few months. And you need the ConnectX-only masked atomics? Or do the standard IB atomic operations work for you? Of course using atomics at all means that things don't work on iWARP. We specifically asked for the masked operations. Yes, this means Oracle will not get the performance boost of atomics on IWARP - but we still get rdma - and that's a real win / benefit for Oracle today - and more so over the next few months. Send with invalidate is an exact match for our current RDS V3 rdma driver - and should be more efficient than the current background syncing of the tpt to ensure keys are invalidated. How does send with invalidate interact with the current IB FMR stuff? Seems that you would run into trouble keeping the state of the FMR straight if the remote side is invalidating them. The model we implement is based on use once keys - we issue the key to the rdma server and want to toss it as soon as the rdma is complete. Today, we explicitly free the key after the rdma completes and we get a message from the rdma server - saying rdma is complete. If the key is auto invalidated by the recv'ing HCA then we do not need to do it in the driver... which also meanswe do not need to issue the sync tpts to force the HCA to be update its cache. At least this is how I think it works - Olaf is the divine source here. Also I would think that send-with-invalidate would be much more expensive than the current FMR method of batching up the invalidates, since you don't get to amortize the cost of syncing up all the internal HCA state. This is the one piece we do not know - our plans are to test this and see where the trade offs are. We will keep the current design / implementation to run over NICs that do not support send-with-invalidate. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 17/20] IB/ipath - user mode send DMA
On Fri, 2008-04-04 at 14:12 -0700, Roland Dreier wrote: By the way... +int ipath_user_sdma_pkt_sent(const struct ipath_user_sdma_queue *pq, + u32 counter) +{ + const u32 scounter = ipath_user_sdma_complete_counter(pq); + const s32 dcounter = scounter - counter; + + return dcounter = 0; +} I don't see this called anywhere... should I just delete it? Yes. You can remove it. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 19/20] IB/ipath - add calls to new 7220 code and enable in build
On Fri, 2008-04-04 at 14:15 -0700, Roland Dreier wrote: +enum ib_rate ipath_mult_to_ib_rate(unsigned mult) +{ + switch (mult) { + case 8: return IB_RATE_2_5_GBPS; + case 4: return IB_RATE_5_GBPS; + case 2: return IB_RATE_10_GBPS; + case 1: return IB_RATE_20_GBPS; + default: return IB_RATE_PORT_CURRENT; + } +} Looks suspiciously like a copy of the existing mult_to_ib_rate() except it handles fewer cases... is there a reason to copy this? - R. It looks similar but the values are reversed. This is converting the ib_rate enum to a multiplier of the DDR clock rate which is used as a counter to delay packets. So IB_RATE_2_5_GBPS is 8 times slower than IB_RATE_20_GBPS. The standard functions map the enum to a multiplier of the slowest rate so IB_RATE_2_5_GBPS is one. If I used the standard functions, I would still need a lookup table to map 8-1, 1-8, etc. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 17/20] IB/ipath - user mode send DMA
On Fri, 2008-04-04 at 14:16 -0700, Roland Dreier wrote: +void ipath_user_sdma_set_complete_counter(struct ipath_user_sdma_queue *pq, +u32 c) +{ + pq-sent_counter = c; +} This is only used in one file... OK to make it static? Yes, thanks. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] ERR 0108: Unknown remote side
On Fri, Apr 04, 2008 at 10:55:21AM -0700, Hal Rosenstock wrote: On Fri, 2008-04-04 at 11:47 +0200, Bernd Schubert wrote: Hello, opensm-3.2.1 logs some error messages like this: Apr 04 00:00:08 325114 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side for node 0 x000b8c002ba2(SW_pfs1_leaf4) port 13. Adding to light sweep sampling list Apr 04 00:00:08 325126 [4580A960] 0x01 - Directed Path Dump of 3 hop path: Path = 0,1,14,13 From ibnetdiscover output I see port13 of this switch is a switch-interconnect (sorry, I don't know what the correct name/identifier for switches within switches): [13]S-000b8c002bfa[13]# SW_pfs1_inter7 lid 263 4xSDR Apr 04 00:00:08 325219 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side for node 0 x000b8c002bf9(SW_pfs1_inter6) port 9. Adding to light sweep sampling list Apr 04 00:00:08 325234 [4580A960] 0x01 - Directed Path Dump of 2 hop path: Path = 0,1,18 This is again an interconnection: [9] S-000b8c002b9e[15]# SW_pfs1_leaf1 lid 177 4xDDR Apr 04 00:00:08 325288 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side for node 0 x000b8c002bfa(SW_pfs1_inter7) port 13. Adding to light sweep sampling list Apr 04 00:00:08 325301 [4580A960] 0x01 - Directed Path Dump of 2 hop path: Path = 0,1,14 And again an interconnection: [13]S-000b8c002ba2[13]# SW_pfs1_leaf4 lid 182 4xDDR All the other interconnections seem to be fine. Any idea if OpenSM 3.1.10 has the same issue as 3.2.1 ? Yes, from the log file I see these messages also did happen with opensm-3.1.10. Is this some large Flextronics switch ? Again you are right, this is a Flextronics F-X430075, presently with 144 ports. Thanks, Bernd ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 19/20] IB/ipath - add calls to new 7220 code and enable in build
It looks similar but the values are reversed. This is converting the ib_rate enum to a multiplier of the DDR clock rate which is used as a counter to delay packets. So IB_RATE_2_5_GBPS is 8 times slower than IB_RATE_20_GBPS. The standard functions map the enum to a multiplier of the slowest rate so IB_RATE_2_5_GBPS is one. If I used the standard functions, I would still need a lookup table to map 8-1, 1-8, etc. OK, got it thanks ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] linux-next: infiniband build failure
Hi Roland, On Fri, 04 Apr 2008 08:47:29 -0700 Roland Dreier [EMAIL PROTECTED] wrote: drivers/infiniband/hw/ehca/ehca_reqs.c: In function 'ehca_write_swqe': drivers/infiniband/hw/ehca/ehca_reqs.c:191: error: 'const struct ib_send_wr' has no member named 'imm_data' Oops, thanks, I forgot to run my cross-compile (and ehca is ppc only). Anyway, your fix is correct and I rolled it into my patch. Thanks. -- Cheers, Stephen Rothwell[EMAIL PROTECTED] http://www.canb.auug.org.au/~sfr/ pgpsWSCX32je9.pgp Description: PGP signature ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH] mthca: update QP state after query QP
thanks, applied ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH] mlx4: update QP state after query QP
thanks, applied ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH] mmu notifier #v11
I am always the guy doing the cleanup after Andrea it seems. Sigh. Here is the mm_lock/mm_unlock logic separated out for easier review. Adds some comments. Still objectionable is the multiple ways of invalidating pages in #v11. Callout now has similar locking to emm. From: Christoph Lameter [EMAIL PROTECTED] Subject: mm_lock: Lock a process against reclaim Provide a way to lock an mm_struct against reclaim (try_to_unmap etc). This is necessary for the invalidate notifier approaches so that they can reliably add and remove a notifier. Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED] Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- include/linux/mm.h | 10 mm/mmap.c | 66 + 2 files changed, 76 insertions(+) Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h 2008-04-02 11:41:47.741678873 -0700 +++ linux-2.6/include/linux/mm.h2008-04-04 15:02:17.660504756 -0700 @@ -1050,6 +1050,16 @@ extern int install_special_mapping(struc unsigned long addr, unsigned long len, unsigned long flags, struct page **pages); +/* + * Locking and unlocking an mm against reclaim. + * + * mm_lock will take mmap_sem writably (to prevent additional vmas from being + * added) and then take all mapping locks of the existing vmas. With that + * reclaim is effectively stopped. + */ +extern void mm_lock(struct mm_struct *mm); +extern void mm_unlock(struct mm_struct *mm); + extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, Index: linux-2.6/mm/mmap.c === --- linux-2.6.orig/mm/mmap.c2008-04-04 14:55:03.477593980 -0700 +++ linux-2.6/mm/mmap.c 2008-04-04 14:59:05.505395402 -0700 @@ -2242,3 +2242,69 @@ int install_special_mapping(struct mm_st return 0; } + +static void mm_lock_unlock(struct mm_struct *mm, int lock) +{ + struct vm_area_struct *vma; + spinlock_t *i_mmap_lock_last, *anon_vma_lock_last; + + i_mmap_lock_last = NULL; + for (;;) { + spinlock_t *i_mmap_lock = (spinlock_t *) -1UL; + for (vma = mm-mmap; vma; vma = vma-vm_next) + if (vma-vm_file vma-vm_file-f_mapping + (unsigned long) i_mmap_lock + (unsigned long) + vma-vm_file-f_mapping-i_mmap_lock + (unsigned long) + vma-vm_file-f_mapping-i_mmap_lock + (unsigned long) i_mmap_lock_last) + i_mmap_lock = + vma-vm_file-f_mapping-i_mmap_lock; + if (i_mmap_lock == (spinlock_t *) -1UL) + break; + i_mmap_lock_last = i_mmap_lock; + if (lock) + spin_lock(i_mmap_lock); + else + spin_unlock(i_mmap_lock); + } + + anon_vma_lock_last = NULL; + for (;;) { + spinlock_t *anon_vma_lock = (spinlock_t *) -1UL; + for (vma = mm-mmap; vma; vma = vma-vm_next) + if (vma-anon_vma + (unsigned long) anon_vma_lock + (unsigned long) vma-anon_vma-lock + (unsigned long) vma-anon_vma-lock + (unsigned long) anon_vma_lock_last) + anon_vma_lock = vma-anon_vma-lock; + if (anon_vma_lock == (spinlock_t *) -1UL) + break; + anon_vma_lock_last = anon_vma_lock; + if (lock) + spin_lock(anon_vma_lock); + else + spin_unlock(anon_vma_lock); + } +} + +/* + * This operation locks against the VM for all pte/vma/mm related + * operations that could ever happen on a certain mm. This includes + * vmtruncate, try_to_unmap, and all page faults. The holder + * must not hold any mm related lock. A single task can't take more + * than one mm lock in a row or it would deadlock. + */ +void mm_lock(struct mm_struct * mm) +{ + down_write(mm-mmap_sem); + mm_lock_unlock(mm, 1); +} + +void mm_unlock(struct mm_struct *mm) +{ + mm_lock_unlock(mm, 0); + up_write(mm-mmap_sem); +} ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH 2 of 2] mlx4: update module version and release date (for 2.6.25)
thanks, applied both this and mthca equivalent ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] XmtDiscards
Hello, after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten much better there, at least no further RcvSwRelayErrors, even when the cluster is in idle state and so far also no SymbolErrors, which we also have seens before. However, after I just started a lustre stress test on 50 clients (to a lustre storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports about 9000 XmtDiscards within 30 minutes. Searching for this error I find This is a symptom of congestion and may require tweaking either HOQ or switch lifetime values. Well, I have to admit I neither know what HOQ is, nor do I know how to tweak it. I also do not have an idea to set switch lifetime values. I guess this isn't related to the opensm timeout option, is it? Hmm, I just found a cisci pdf describing how to set the lifetime on these switches, but is this also possible on Flextronics switches? Thanks for any help, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH] mlx4: make firmware diagnostic counters available via sysfs
+int mlx4_query_diag_counters(struct mlx4_dev *dev, int array_length, + int in_modifier, unsigned int in_offset[], + u32 counter_out[]) +{ +struct mlx4_cmd_mailbox *mailbox; +u32 *outbox; +u32 op_modifer = (u32)in_modifier; This coding style looks strange to me... you have an int parameter in_modifier that is not used for anything except to assign it to a u32 op_modifer [sic] variable with a (u32) cast that doesn't do anything. Why not just have op_modifier be the parameter in the first place? Also the array_length stuff looks kind of funny since you only ever pass in a value of 1... why not just pass in int offset and u32 *counter? +/* clear counters file, can't read it */ +if(offset 0) +return sprintf(buf,This file is write only\n); Why not just set the permissions on the file so it can't be opened for reading? This just looks like a recipe for making userspace code go crazy on unexpected input. Also watch out for the space in if ( And if I'm understanding correctly, you use a magic offset of -1 for the clear_diag attribute that makes mlx4_query_diag_counters() read before the beginning of the output mailbox. +err_diag: +ib_unregister_device(ibdev-ib_dev); + err_reg: ib_unregister_device(ibdev-ib_dev); This doesn't look like a good idea. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] XmtDiscards
Hi Bernd, You can configure the HOQ (Head-Of-Queue-Lifetime) value programmed in any switch in the fabric managed by OpenSM following these simple steps: 1. Stop the SM /etc/init.d/opensmd stop 2. Run the SM manually with the -c option (to dump its default configuration to a file) opensm -c 3. Kill the SM with ^C 4. The configuration is saved in /var/cache/opensm/opensm.opts. Open the file and look for head_of_queue_lifetime. Change the value and save the file. 5. Restart the SM /etc/init.d/opensmd start P.S. You might find 'opensm -h' and 'man opensm' useful. Hope this helps, Boris Shpolyansky Sr. Member of Technical Staff Applications Mellanox Technologies Inc. 2900 Stender Way Santa Clara, CA 95054 Tel.: (408) 916 0014 Fax: (408) 970 3403 Cell: (408) 834 9365 www.mellanox.com -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Bernd Schubert Sent: Friday, April 04, 2008 3:13 PM To: OpenIB Subject: [ofa-general] XmtDiscards Hello, after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten much better there, at least no further RcvSwRelayErrors, even when the cluster is in idle state and so far also no SymbolErrors, which we also have seens before. However, after I just started a lustre stress test on 50 clients (to a lustre storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports about 9000 XmtDiscards within 30 minutes. Searching for this error I find This is a symptom of congestion and may require tweaking either HOQ or switch lifetime values. Well, I have to admit I neither know what HOQ is, nor do I know how to tweak it. I also do not have an idea to set switch lifetime values. I guess this isn't related to the opensm timeout option, is it? Hmm, I just found a cisci pdf describing how to set the lifetime on these switches, but is this also possible on Flextronics switches? Thanks for any help, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] XmtDiscards
On Sat, 5 Apr 2008 00:12:39 +0200 Bernd Schubert [EMAIL PROTECTED] wrote: Hello, after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten much better there, at least no further RcvSwRelayErrors, even when the cluster is in idle state and so far also no SymbolErrors, which we also have seens before. However, after I just started a lustre stress test on 50 clients (to a lustre storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports about 9000 XmtDiscards within 30 minutes. Yea, those are bad. Searching for this error I find This is a symptom of congestion and may require tweaking either HOQ or switch lifetime values. Well, I have to admit I neither know what HOQ is, nor do I know how to tweak it. I also do not have an idea to set switch lifetime values. I guess this isn't related to the opensm timeout option, is it? Yes you should adjust these values. Hmm, I just found a cisci pdf describing how to set the lifetime on these switches, but is this also possible on Flextronics switches? I don't know about the Vendor SMs but in opensm look for the following options in the opensm.opts file (Default path is: /var/cache/opensm): # The code of maximal time a packet can wait at the head of # transmission queue. # The actual time is 4.096usec * 2^head_of_queue_lifetime # The value 0x14 disables this mechanism head_of_queue_lifetime 0x12 # The maximal time a packet can wait at the head of queue on # switch port connected to a CA or router port leaf_head_of_queue_lifetime 0x0c Ira ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [patch 00/10] [RFC] EMM Notifier V3
V2-V3: - Fix rcu issues - Fix emm_referenced handling - Use Andrea's mm_lock/unlock to prevent registration races. - Keep simple API since there does not seem to be a need to add additional callbacks (mm_lock does not require callbacks like emm_start/stop that I envisioned). - Reduce CC list (the volume we are producing here must be annoying...). V1-V2: - Additional optimizations in the VM - Convert vm spinlocks to rw sems. - Add XPMEM driver (requires sleeping in callbacks) - Add XPMEM example This patch implements a simple callback for device drivers that establish their own references to pages (KVM, GRU, XPmem, RDMA/Infiniband, DMA engines etc). These references are unknown to the VM (therefore external). With these callbacks it is possible for the device driver to release external references when the VM requests it. This enables swapping, page migration and allows support of remapping, permission changes etc etc for the externally mapped memory. With this functionality it becomes also possible to avoid pinning or mlocking pages (commonly done to stop the VM from unmapping device mapped pages). A device driver must subscribe to a process using emm_register_notifier(struct emm_notifier *, struct mm_struct *) The VM will then perform callbacks for operations that unmap or change permissions of pages in that address space. When the process terminates the callback function is called with emm_release. Callbacks are performed before and after the unmapping action of the VM. emm_invalidate_startbefore emm_invalidate_end after The device driver must hold off establishing new references to pages in the range specified between a callback with emm_invalidate_start and the subsequent call with emm_invalidate_end set. This allows the VM to ensure that no concurrent driver actions are performed on an address range while performing remapping or unmapping operations. This patchset contains additional modifications needed to ensure that the callbacks can sleep. For that purpose two key locks in the vm need to be converted to rw_sems. These patches are brand new, invasive and need extensive discussion and evaluation. The first patch alone may be applied if callbacks in atomic context are sufficient for a device driver (likely the case for KVM and GRU and simple DMA drivers). Following the VM modifications is the XPMEM device driver that allows sharing of memory between processes running on different instances of Linux. This is also a prototype. It is known to run trivial sample programs included as the last patch. -- ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [patch 01/10] emm: mm_lock: Lock a process against reclaim
Provide a way to lock an mm_struct against reclaim (try_to_unmap etc). This is necessary for the invalidate notifier approaches so that they can reliably add and remove a notifier. Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED] Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- include/linux/mm.h | 10 mm/mmap.c | 66 + 2 files changed, 76 insertions(+) Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h 2008-04-02 11:41:47.741678873 -0700 +++ linux-2.6/include/linux/mm.h2008-04-04 15:02:17.660504756 -0700 @@ -1050,6 +1050,16 @@ extern int install_special_mapping(struc unsigned long addr, unsigned long len, unsigned long flags, struct page **pages); +/* + * Locking and unlocking am mm against reclaim. + * + * mm_lock will take mmap_sem writably (to prevent additional vmas from being + * added) and then take all mapping locks of the existing vmas. With that + * reclaim is effectively stopped. + */ +extern void mm_lock(struct mm_struct *mm); +extern void mm_unlock(struct mm_struct *mm); + extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, Index: linux-2.6/mm/mmap.c === --- linux-2.6.orig/mm/mmap.c2008-04-04 14:55:03.477593980 -0700 +++ linux-2.6/mm/mmap.c 2008-04-04 14:59:05.505395402 -0700 @@ -2242,3 +2242,69 @@ int install_special_mapping(struct mm_st return 0; } + +static void mm_lock_unlock(struct mm_struct *mm, int lock) +{ + struct vm_area_struct *vma; + spinlock_t *i_mmap_lock_last, *anon_vma_lock_last; + + i_mmap_lock_last = NULL; + for (;;) { + spinlock_t *i_mmap_lock = (spinlock_t *) -1UL; + for (vma = mm-mmap; vma; vma = vma-vm_next) + if (vma-vm_file vma-vm_file-f_mapping + (unsigned long) i_mmap_lock + (unsigned long) + vma-vm_file-f_mapping-i_mmap_lock + (unsigned long) + vma-vm_file-f_mapping-i_mmap_lock + (unsigned long) i_mmap_lock_last) + i_mmap_lock = + vma-vm_file-f_mapping-i_mmap_lock; + if (i_mmap_lock == (spinlock_t *) -1UL) + break; + i_mmap_lock_last = i_mmap_lock; + if (lock) + spin_lock(i_mmap_lock); + else + spin_unlock(i_mmap_lock); + } + + anon_vma_lock_last = NULL; + for (;;) { + spinlock_t *anon_vma_lock = (spinlock_t *) -1UL; + for (vma = mm-mmap; vma; vma = vma-vm_next) + if (vma-anon_vma + (unsigned long) anon_vma_lock + (unsigned long) vma-anon_vma-lock + (unsigned long) vma-anon_vma-lock + (unsigned long) anon_vma_lock_last) + anon_vma_lock = vma-anon_vma-lock; + if (anon_vma_lock == (spinlock_t *) -1UL) + break; + anon_vma_lock_last = anon_vma_lock; + if (lock) + spin_lock(anon_vma_lock); + else + spin_unlock(anon_vma_lock); + } +} + +/* + * This operation locks against the VM for all pte/vma/mm related + * operations that could ever happen on a certain mm. This includes + * vmtruncate, try_to_unmap, and all page faults. The holder + * must not hold any mm related lock. A single task can't take more + * than one mm lock in a row or it would deadlock. + */ +void mm_lock(struct mm_struct * mm) +{ + down_write(mm-mmap_sem); + mm_lock_unlock(mm, 1); +} + +void mm_unlock(struct mm_struct *mm) +{ + mm_lock_unlock(mm, 0); + up_write(mm-mmap_sem); +} -- ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [patch 06/10] emm: Convert anon_vma lock to rw_sem and refcount
Convert the anon_vma spinlock to a rw semaphore. This allows concurrent traversal of reverse maps for try_to_unmap and page_mkclean. It also allows the calling of sleeping functions from reverse map traversal. An additional complication is that rcu is used in some context to guarantee the presence of the anon_vma while we acquire the lock. We cannot take a semaphore within an rcu critical section. Add a refcount to the anon_vma structure which allow us to give an existence guarantee for the anon_vma structure independent of the spinlock or the list contents. The refcount can then be taken within the RCU section. If it has been taken successfully then the refcount guarantees the existence of the anon_vma. The refcount in anon_vma also allows us to fix a nasty issue in page migration where we fudged by using rcu for a long code path to guarantee the existence of the anon_vma. The refcount in general allows a shortening of RCU critical sections since we can do a rcu_unlock after taking the refcount. This is particularly relevant if the anon_vma chains contain hundreds of entries. Issues: - Atomic overhead increases in situations where a new reference to the anon_vma has to be established or removed. Overhead also increases when a speculative reference is used (try_to_unmap, page_mkclean, page migration). There is also the more frequent processor change due to up_xxx letting waiting tasks run first. This results in f.e. the Aim9 brk performance test to got down by 10-15%. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- include/linux/rmap.h | 20 --- mm/migrate.c | 26 ++--- mm/mmap.c| 28 +- mm/rmap.c| 53 +-- 4 files changed, 73 insertions(+), 54 deletions(-) Index: linux-2.6/include/linux/rmap.h === --- linux-2.6.orig/include/linux/rmap.h 2008-04-04 15:09:45.403759876 -0700 +++ linux-2.6/include/linux/rmap.h 2008-04-04 15:16:54.318714568 -0700 @@ -25,7 +25,8 @@ * pointing to this anon_vma once its vma list is empty. */ struct anon_vma { - spinlock_t lock;/* Serialize access to vma list */ + atomic_t refcount; /* vmas on the list */ + struct rw_semaphore sem;/* Serialize access to vma list */ struct list_head head; /* List of private related vmas */ }; @@ -43,18 +44,31 @@ static inline void anon_vma_free(struct kmem_cache_free(anon_vma_cachep, anon_vma); } +struct anon_vma *grab_anon_vma(struct page *page); + +static inline void get_anon_vma(struct anon_vma *anon_vma) +{ + atomic_inc(anon_vma-refcount); +} + +static inline void put_anon_vma(struct anon_vma *anon_vma) +{ + if (atomic_dec_and_test(anon_vma-refcount)) + anon_vma_free(anon_vma); +} + static inline void anon_vma_lock(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma-anon_vma; if (anon_vma) - spin_lock(anon_vma-lock); + down_write(anon_vma-sem); } static inline void anon_vma_unlock(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma-anon_vma; if (anon_vma) - spin_unlock(anon_vma-lock); + up_write(anon_vma-sem); } /* Index: linux-2.6/mm/migrate.c === --- linux-2.6.orig/mm/migrate.c 2008-04-04 15:09:45.443760619 -0700 +++ linux-2.6/mm/migrate.c 2008-04-04 15:16:54.318714568 -0700 @@ -235,15 +235,16 @@ static void remove_anon_migration_ptes(s return; /* -* We hold the mmap_sem lock. So no need to call page_lock_anon_vma. +* We hold either the mmap_sem lock or a reference on the +* anon_vma. So no need to call page_lock_anon_vma. */ anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON); - spin_lock(anon_vma-lock); + down_read(anon_vma-sem); list_for_each_entry(vma, anon_vma-head, anon_vma_node) remove_migration_pte(vma, old, new); - spin_unlock(anon_vma-lock); + up_read(anon_vma-sem); } /* @@ -623,7 +624,7 @@ static int unmap_and_move(new_page_t get int rc = 0; int *result = NULL; struct page *newpage = get_new_page(page, private, result); - int rcu_locked = 0; + struct anon_vma *anon_vma = NULL; int charge = 0; if (!newpage) @@ -647,16 +648,14 @@ static int unmap_and_move(new_page_t get } /* * By try_to_unmap(), page-mapcount goes down to 0 here. In this case, -* we cannot notice that anon_vma is freed while we migrates a page. +* we cannot notice that anon_vma is freed while we migrate a page. * This rcu_read_lock() delays freeing anon_vma pointer until the end * of migration. File cache pages are no
[ofa-general] [patch 04/10] emm: Convert i_mmap_lock to i_mmap_sem
The conversion to a rwsem allows callbacks during rmap traversal for files in a non atomic context. A rw style lock also allows concurrent walking of the reverse map. This is fairly straightforward if one removes pieces of the resched checking. [Restarting unmapping is an issue to be discussed]. This slightly increases Aim9 performance results on an 8p. Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED] Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- arch/x86/mm/hugetlbpage.c |4 ++-- fs/hugetlbfs/inode.c |4 ++-- fs/inode.c|2 +- include/linux/fs.h|2 +- include/linux/mm.h|2 +- kernel/fork.c |4 ++-- mm/filemap.c |8 mm/filemap_xip.c |4 ++-- mm/fremap.c |4 ++-- mm/hugetlb.c | 10 +- mm/memory.c | 29 + mm/migrate.c |4 ++-- mm/mmap.c | 43 ++- mm/mremap.c |4 ++-- mm/rmap.c | 20 +--- 15 files changed, 66 insertions(+), 78 deletions(-) Index: linux-2.6/arch/x86/mm/hugetlbpage.c === --- linux-2.6.orig/arch/x86/mm/hugetlbpage.c2008-04-02 11:41:47.601676490 -0700 +++ linux-2.6/arch/x86/mm/hugetlbpage.c 2008-04-04 15:09:11.715211829 -0700 @@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str if (!vma_shareable(vma, addr)) return; - spin_lock(mapping-i_mmap_lock); + down_read(mapping-i_mmap_sem); vma_prio_tree_foreach(svma, iter, mapping-i_mmap, idx, idx) { if (svma == vma) continue; @@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str put_page(virt_to_page(spte)); spin_unlock(mm-page_table_lock); out: - spin_unlock(mapping-i_mmap_lock); + up_read(mapping-i_mmap_sem); } /* Index: linux-2.6/fs/hugetlbfs/inode.c === --- linux-2.6.orig/fs/hugetlbfs/inode.c 2008-04-02 11:41:47.605676583 -0700 +++ linux-2.6/fs/hugetlbfs/inode.c 2008-04-04 15:09:11.743212273 -0700 @@ -454,10 +454,10 @@ static int hugetlb_vmtruncate(struct ino pgoff = offset PAGE_SHIFT; i_size_write(inode, offset); - spin_lock(mapping-i_mmap_lock); + down_read(mapping-i_mmap_sem); if (!prio_tree_empty(mapping-i_mmap)) hugetlb_vmtruncate_list(mapping-i_mmap, pgoff); - spin_unlock(mapping-i_mmap_lock); + up_read(mapping-i_mmap_sem); truncate_hugepages(inode, offset); return 0; } Index: linux-2.6/fs/inode.c === --- linux-2.6.orig/fs/inode.c 2008-04-02 11:41:47.613676625 -0700 +++ linux-2.6/fs/inode.c2008-04-04 15:09:11.755212477 -0700 @@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode INIT_LIST_HEAD(inode-i_devices); INIT_RADIX_TREE(inode-i_data.page_tree, GFP_ATOMIC); rwlock_init(inode-i_data.tree_lock); - spin_lock_init(inode-i_data.i_mmap_lock); + init_rwsem(inode-i_data.i_mmap_sem); INIT_LIST_HEAD(inode-i_data.private_list); spin_lock_init(inode-i_data.private_lock); INIT_RAW_PRIO_TREE_ROOT(inode-i_data.i_mmap); Index: linux-2.6/include/linux/fs.h === --- linux-2.6.orig/include/linux/fs.h 2008-04-02 11:41:47.621676899 -0700 +++ linux-2.6/include/linux/fs.h2008-04-04 15:09:11.755212477 -0700 @@ -503,7 +503,7 @@ struct address_space { unsigned inti_mmap_writable;/* count VM_SHARED mappings */ struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_headi_mmap_nonlinear;/*list VM_NONLINEAR mappings */ - spinlock_t i_mmap_lock;/* protect tree, count, list */ + struct rw_semaphore i_mmap_sem; /* protect tree, count, list */ unsigned inttruncate_count; /* Cover race condition with truncate */ unsigned long nrpages;/* number of total pages */ pgoff_t writeback_index;/* writeback starts here */ Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h 2008-04-04 15:09:11.687211361 -0700 +++ linux-2.6/include/linux/mm.h2008-04-04 15:09:45.883767696 -0700 @@ -716,7 +716,7 @@ struct zap_details { struct address_space *check_mapping;/* Check page-mapping if set */ pgoff_t first_index;/* Lowest page-index to unmap */ pgoff_t last_index; /* Highest page-index to unmap */ - spinlock_t *i_mmap_lock;
[ofa-general] [patch 10/10] xpmem: Simple example
A simple test program (well actually a pair). They are fairly easy to use. NOTE: the xpmem.h is copied from the kernel/drivers/misc/xp/xpmem.h file. Type make. Then from one session, type ./A1. Grab the first line of output which should begin with ./A2 and paste the whole line into a second session. Paste as many times as you like. Each pass will increment the value one additional time. When you are tired, hit enter in the first window. You should see the same value printed from A1 as you most recently received from A2. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- xpmem_test/A1.c | 64 + xpmem_test/A2.c | 70 xpmem_test/Makefile | 14 + xpmem_test/xpmem.h | 130 4 files changed, 278 insertions(+) Index: linux-2.6/xpmem_test/A1.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6/xpmem_test/A1.c 2008-04-04 15:09:11.955215737 -0700 @@ -0,0 +1,64 @@ +/* + * Simple test program. Makes a segment then waits for an input line + * and finally prints the value of the first integer of that segment. + */ + +#include errno.h +#include fcntl.h +#include stdio.h +#include stdlib.h +#include stropts.h +#include sys/mman.h +#include sys/stat.h +#include sys/types.h +#include unistd.h + +#include xpmem.h + +int xpmem_fd; + +int +main(int argc, char **argv) +{ + char input[32]; + struct xpmem_cmd_make make_info; + int *data_block; + int ret; + __s64 segid; + + xpmem_fd = open(/dev/xpmem, O_RDWR); + if (xpmem_fd == -1) { + perror(Opening /dev/xpmem); + return -1; + } + + data_block = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, 0, 0); + if (data_block == MAP_FAILED) { + perror(Creating mapping.); + return -1; + } + data_block[0] = 1; + + make_info.vaddr = (__u64) data_block; + make_info.size = getpagesize(); + make_info.permit_type = XPMEM_PERMIT_MODE; + make_info.permit_value = (__u64) 0600; + ret = ioctl(xpmem_fd, XPMEM_CMD_MAKE, make_info); + if (ret != 0) { + perror(xpmem_make); + return -1; + } + + segid = make_info.segid; + printf(./A2 %d %d %d %d\ndata_block[0] = %d\n, + (int)(segid 48 0x), (int)(segid 32 0x), + (int)(segid 16 0x), (int)(segid 0x), + data_block[0]); + printf(Waiting for input before exiting.\n); + fscanf(stdin, %s, input); + + printf(data_block[0] = %d\n, data_block[0]); + + return 0; +} Index: linux-2.6/xpmem_test/A2.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6/xpmem_test/A2.c 2008-04-04 15:09:11.955215737 -0700 @@ -0,0 +1,70 @@ +/* + * Simple test program that gets then attaches an xpmem segment identified + * on the command line then increments the first integer of that buffer by + * one and exits. + */ + +#include errno.h +#include fcntl.h +#include stdio.h +#include stdlib.h +#include stropts.h +#include sys/mman.h +#include sys/stat.h +#include sys/types.h +#include unistd.h + +#include xpmem.h + +int xpmem_fd; + +int +main(int argc, char **argv) +{ + int ret; + __s64 segid; + __s64 apid; + struct xpmem_cmd_get get_info; + struct xpmem_cmd_attach attach_info; + int *attached_buffer; + + xpmem_fd = open(/dev/xpmem, O_RDWR); + if (xpmem_fd == -1) { + perror(Opening /dev/xpmem); + return -1; + } + + segid = (__s64) atoi(argv[1]) 48; + segid |= (__s64) atoi(argv[2]) 32; + segid |= (__s64) atoi(argv[3]) 16; + segid |= (__s64) atoi(argv[4]); + get_info.segid = segid; + get_info.flags = XPMEM_RDWR; + get_info.permit_type = XPMEM_PERMIT_MODE; + get_info.permit_value = (__u64) NULL; + ret = ioctl(xpmem_fd, XPMEM_CMD_GET, get_info); + if (ret != 0) { + perror(xpmem_get); + return -1; + } + apid = get_info.apid; + + attach_info.apid = get_info.apid; + attach_info.offset = 0; + attach_info.size = getpagesize(); + attach_info.vaddr = (__u64) NULL; + attach_info.fd = xpmem_fd; + attach_info.flags = 0; + + ret = ioctl(xpmem_fd, XPMEM_CMD_ATTACH, attach_info); + if (ret != 0) { + perror(xpmem_attach); + return -1; + } + + attached_buffer = (int *)attach_info.vaddr; + attached_buffer[0]++; + + printf(Just incremented the value to %d\n, attached_buffer[0]); + return 0; +} Index: linux-2.6/xpmem_test/Makefile
[ofa-general] [patch 02/10] emm: notifier logic
This patch implements a simple callback for device drivers that establish their own references to pages (KVM, GRU, XPmem, RDMA/Infiniband, DMA engines etc). These references are unknown to the VM (therefore external). With these callbacks it is possible for the device driver to release external references when the VM requests it. This enables swapping, page migration and allows support of remapping, permission changes etc etc for externally mapped memory. With this functionality it becomes also possible to avoid pinning or mlocking pages (commonly done to stop the VM from unmapping device mapped pages). A device driver must subscribe to a process using emm_register_notifier(struct emm_notifier *, struct mm_struct *) The VM will then perform callbacks for operations that unmap or change permissions of pages in that address space. When the process terminates the callback function is called with emm_release. Callbacks are performed before and after the unmapping action of the VM. emm_invalidate_startbefore emm_invalidate_end after The device driver must hold off establishing new references to pages in the range specified between a callback with emm_invalidate_start and the subsequent call with emm_invalidate_end set. This allows the VM to ensure that no concurrent driver actions are performed on an address range while performing remapping or unmapping operations. Callbacks are mostly performed in a non atomic context. However, in various places spinlocks are held to traverse rmaps. So this patch here is only useful for those devices that can remove mappings in an atomic context (f.e. KVM/GRU). If the rmap spinlocks are converted to semaphores then all callbacks will be performed in a nonatomic context. No additional changes will be necessary to this patchset. V1-V2: - page_referenced_one: Do not increment reference count if it is already != 0. - Use rcu_assign_pointer and rcu_derefence_pointer instead of putting in our own barriers. V2-V3: - Fix rcu (thanks Paul) - Fix exit code handling to come up with the right semantings for emm_referenced (thanks Andrea) - Call mm_lock/mm_unlock to protect against registration races. Acked-by: Paul E. McKenney [EMAIL PROTECTED] Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- include/linux/mm_types.h |3 + include/linux/rmap.h | 50 +++ kernel/fork.c|3 + mm/Kconfig |5 ++ mm/filemap_xip.c |4 + mm/fremap.c |2 mm/hugetlb.c |3 + mm/memory.c | 42 +++ mm/mmap.c|3 + mm/mprotect.c|3 + mm/mremap.c |4 + mm/rmap.c| 100 ++- 12 files changed, 212 insertions(+), 10 deletions(-) Index: linux-2.6/include/linux/mm_types.h === --- linux-2.6.orig/include/linux/mm_types.h 2008-04-04 14:55:03.441593394 -0700 +++ linux-2.6/include/linux/mm_types.h 2008-04-04 15:07:38.857699751 -0700 @@ -225,6 +225,9 @@ struct mm_struct { #ifdef CONFIG_CGROUP_MEM_RES_CTLR struct mem_cgroup *mem_cgroup; #endif +#ifdef CONFIG_EMM_NOTIFIER + struct emm_notifier *emm_notifier; +#endif }; #endif /* _LINUX_MM_TYPES_H */ Index: linux-2.6/mm/Kconfig === --- linux-2.6.orig/mm/Kconfig 2008-04-04 14:55:03.457593678 -0700 +++ linux-2.6/mm/Kconfig2008-04-04 15:07:38.857699751 -0700 @@ -193,3 +193,8 @@ config NR_QUICK config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config EMM_NOTIFIER + def_bool n + bool External Mapped Memory Notifier for drivers directly mapping memory + Index: linux-2.6/include/linux/rmap.h === --- linux-2.6.orig/include/linux/rmap.h 2008-04-04 14:55:03.449593554 -0700 +++ linux-2.6/include/linux/rmap.h 2008-04-04 15:08:51.522883171 -0700 @@ -85,6 +85,56 @@ static inline void page_dup_rmap(struct #endif /* + * Notifier for devices establishing their own references to Linux + * kernel pages in addition to the regular mapping via page + * table and rmap. The notifier allows the device to drop the mapping + * when the VM removes references to pages. + */ +enum emm_operation { + emm_release,/* Process exiting, */ + emm_invalidate_start, /* Before the VM unmaps pages */ + emm_invalidate_end, /* After the VM unmapped pages */ + emm_referenced /* Check if a range was referenced */ +}; + +struct emm_notifier { + int (*callback)(struct emm_notifier *e, struct mm_struct *mm, + enum emm_operation op, + unsigned long start, unsigned long end); + struct emm_notifier *next; +}; + +extern int __emm_notify(struct mm_struct
[ofa-general] [patch 08/10] xpmem: Locking rules for taking multiple mmap_sem locks.
This patch adds a lock ordering rule to avoid a potential deadlock when multiple mmap_sems need to be locked. Signed-off-by: Dean Nelson [EMAIL PROTECTED] --- mm/filemap.c |3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2008-04-01 13:02:41.374608387 -0700 +++ linux-2.6/mm/filemap.c 2008-04-01 13:05:02.777015782 -0700 @@ -80,6 +80,9 @@ generic_file_direct_IO(int rw, struct ki * -i_mutex (generic_file_buffered_write) *-mmap_sem (fault_in_pages_readable-do_page_fault) * + *When taking multiple mmap_sems, one should lock the lowest-addressed + *one first proceeding on up to the highest-addressed one. + * * -i_mutex *-i_alloc_sem (various) * -- ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [patch 03/10] emm: Move tlb flushing into free_pgtables
Move the tlb flushing into free_pgtables. The conversion of the locks taken for reverse map scanning would require taking sleeping locks in free_pgtables(). Moving the tlb flushing into free_pgtables allows sleeping in parts of free_pgtables(). This means that we do a tlb_finish_mmu() before freeing the page tables. Strictly speaking there may not be the need to do another tlb flush after freeing the tables. But its the only way to free a series of page table pages from the tlb list. And we do not want to call into the page allocator for performance reasons. Aim9 numbers look okay after this patch. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- include/linux/mm.h |4 ++-- mm/memory.c| 14 ++ mm/mmap.c |6 +++--- 3 files changed, 15 insertions(+), 9 deletions(-) Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h 2008-03-19 13:30:51.460856986 -0700 +++ linux-2.6/include/linux/mm.h2008-03-19 13:31:20.809377398 -0700 @@ -751,8 +751,8 @@ int walk_page_range(const struct mm_stru void *private); void free_pgd_range(struct mmu_gather **tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); -void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma, - unsigned long floor, unsigned long ceiling); +void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor, + unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); void unmap_mapping_range(struct address_space *mapping, Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c 2008-03-19 13:29:06.007351495 -0700 +++ linux-2.6/mm/memory.c 2008-03-19 13:46:31.352774359 -0700 @@ -271,9 +271,11 @@ void free_pgd_range(struct mmu_gather ** } while (pgd++, addr = next, addr != end); } -void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma, - unsigned long floor, unsigned long ceiling) +void free_pgtables(struct vm_area_struct *vma, unsigned long floor, + unsigned long ceiling) { + struct mmu_gather *tlb; + while (vma) { struct vm_area_struct *next = vma-vm_next; unsigned long addr = vma-vm_start; @@ -285,8 +287,10 @@ void free_pgtables(struct mmu_gather **t unlink_file_vma(vma); if (is_vm_hugetlb_page(vma)) { - hugetlb_free_pgd_range(tlb, addr, vma-vm_end, + tlb = tlb_gather_mmu(vma-vm_mm, 0); + hugetlb_free_pgd_range(tlb, addr, vma-vm_end, floor, next? next-vm_start: ceiling); + tlb_finish_mmu(tlb, addr, vma-vm_end); } else { /* * Optimization: gather nearby vmas into one call down @@ -298,8 +302,10 @@ void free_pgtables(struct mmu_gather **t anon_vma_unlink(vma); unlink_file_vma(vma); } - free_pgd_range(tlb, addr, vma-vm_end, + tlb = tlb_gather_mmu(vma-vm_mm, 0); + free_pgd_range(tlb, addr, vma-vm_end, floor, next? next-vm_start: ceiling); + tlb_finish_mmu(tlb, addr, vma-vm_end); } vma = next; } Index: linux-2.6/mm/mmap.c === --- linux-2.6.orig/mm/mmap.c2008-03-19 13:29:48.659889667 -0700 +++ linux-2.6/mm/mmap.c 2008-03-19 13:30:36.296604891 -0700 @@ -1750,9 +1750,9 @@ static void unmap_region(struct mm_struc update_hiwater_rss(mm); unmap_vmas(tlb, vma, start, end, nr_accounted, NULL); vm_unacct_memory(nr_accounted); - free_pgtables(tlb, vma, prev? prev-vm_end: FIRST_USER_ADDRESS, -next? next-vm_start: 0); tlb_finish_mmu(tlb, start, end); + free_pgtables(vma, prev? prev-vm_end: FIRST_USER_ADDRESS, +next? next-vm_start: 0); emm_notify(mm, emm_invalidate_end, start, end); } @@ -2049,8 +2049,8 @@ void exit_mmap(struct mm_struct *mm) /* Use -1 here to ensure all VMAs in the mm are unmapped */ end = unmap_vmas(tlb, vma, 0, -1, nr_accounted, NULL); vm_unacct_memory(nr_accounted); - free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(tlb, 0, end); + free_pgtables(vma, FIRST_USER_ADDRESS, 0); /* * Walk the list again, actually closing and freeing it,
[ofa-general] [patch 05/10] emm: Remove tlb pointer from the parameters of unmap vmas
We no longer abort unmapping in unmap vmas because we can reschedule while unmapping since we are holding a semaphore. This would allow moving more of the tlb flusing into unmap_vmas reducing code in various places. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- include/linux/mm.h |3 +-- mm/memory.c| 43 +-- mm/mmap.c | 18 +++--- 3 files changed, 21 insertions(+), 43 deletions(-) Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h 2008-04-01 13:02:41.374608387 -0700 +++ linux-2.6/include/linux/mm.h2008-04-01 13:02:43.898651546 -0700 @@ -723,8 +723,7 @@ struct zap_details { struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t); unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *); -unsigned long unmap_vmas(struct mmu_gather **tlb, - struct vm_area_struct *start_vma, unsigned long start_addr, +unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *); Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c 2008-04-01 13:02:41.378608315 -0700 +++ linux-2.6/mm/memory.c 2008-04-01 13:02:43.902651345 -0700 @@ -806,7 +806,6 @@ static unsigned long unmap_page_range(st /** * unmap_vmas - unmap a range of memory covered by a list of vma's - * @tlbp: address of the caller's struct mmu_gather * @vma: the starting vma * @start_addr: virtual address at which to start unmapping * @end_addr: virtual address at which to end unmapping @@ -818,20 +817,13 @@ static unsigned long unmap_page_range(st * Unmap all pages in the vma list. * * We aim to not hold locks for too long (for scheduling latency reasons). - * So zap pages in ZAP_BLOCK_SIZE bytecounts. This means we need to - * return the ending mmu_gather to the caller. + * So zap pages in ZAP_BLOCK_SIZE bytecounts. * * Only addresses between `start' and `end' will be unmapped. * * The VMA list must be sorted in ascending virtual address order. - * - * unmap_vmas() assumes that the caller will flush the whole unmapped address - * range after unmap_vmas() returns. So the only responsibility here is to - * ensure that any thus-far unmapped pages are flushed before unmap_vmas() - * drops the lock and schedules. */ -unsigned long unmap_vmas(struct mmu_gather **tlbp, - struct vm_area_struct *vma, unsigned long start_addr, +unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *details) { @@ -839,7 +831,15 @@ unsigned long unmap_vmas(struct mmu_gath unsigned long tlb_start = 0;/* For tlb_finish_mmu */ int tlb_start_valid = 0; unsigned long start = start_addr; - int fullmm = (*tlbp)-fullmm; + int fullmm; + struct mmu_gather *tlb; + struct mm_struct *mm = vma-vm_mm; + + emm_notify(mm, emm_invalidate_start, start_addr, end_addr); + lru_add_drain(); + tlb = tlb_gather_mmu(mm, 0); + update_hiwater_rss(mm); + fullmm = tlb-fullmm; for ( ; vma vma-vm_start end_addr; vma = vma-vm_next) { unsigned long end; @@ -866,7 +866,7 @@ unsigned long unmap_vmas(struct mmu_gath (HPAGE_SIZE / PAGE_SIZE); start = end; } else - start = unmap_page_range(*tlbp, vma, + start = unmap_page_range(tlb, vma, start, end, zap_work, details); if (zap_work 0) { @@ -874,13 +874,15 @@ unsigned long unmap_vmas(struct mmu_gath break; } - tlb_finish_mmu(*tlbp, tlb_start, start); + tlb_finish_mmu(tlb, tlb_start, start); cond_resched(); - *tlbp = tlb_gather_mmu(vma-vm_mm, fullmm); + tlb = tlb_gather_mmu(vma-vm_mm, fullmm); tlb_start_valid = 0; zap_work = ZAP_BLOCK_SIZE; } } + tlb_finish_mmu(tlb, start_addr, end_addr); + emm_notify(mm, emm_invalidate_end, start_addr, end_addr); return start; /* which is now the end (or restart) address */ } @@ -894,21 +896,10 @@ unsigned long unmap_vmas(struct mmu_gath unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct
[ofa-general] [patch 07/10] xpmem: This patch exports zap_page_range as it is needed by XPMEM.
XPMEM would have used sys_madvise() except that madvise_dontneed() returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages XPMEM imports from other partitions and is also true for uncached pages allocated locally via the mspec allocator. XPMEM needs zap_page_range() functionality for these types of pages as well as 'normal' pages. Signed-off-by: Dean Nelson [EMAIL PROTECTED] --- mm/memory.c |1 + 1 file changed, 1 insertion(+) Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c 2008-04-01 13:02:43.902651345 -0700 +++ linux-2.6/mm/memory.c 2008-04-01 13:04:43.720691616 -0700 @@ -901,6 +901,7 @@ unsigned long zap_page_range(struct vm_a return unmap_vmas(vma, address, end, nr_accounted, details); } +EXPORT_SYMBOL_GPL(zap_page_range); /* * Do a quick page-table lookup for a single page. -- ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [patch 01/10] emm: mm_lock: Lock a process against reclaim
Christoph Lameter wrote: Provide a way to lock an mm_struct against reclaim (try_to_unmap etc). This is necessary for the invalidate notifier approaches so that they can reliably add and remove a notifier. Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED] Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- include/linux/mm.h | 10 mm/mmap.c | 66 + 2 files changed, 76 insertions(+) Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h 2008-04-02 11:41:47.741678873 -0700 +++ linux-2.6/include/linux/mm.h2008-04-04 15:02:17.660504756 -0700 @@ -1050,6 +1050,16 @@ extern int install_special_mapping(struc unsigned long addr, unsigned long len, unsigned long flags, struct page **pages); +/* + * Locking and unlocking am mm against reclaim. + * + * mm_lock will take mmap_sem writably (to prevent additional vmas from being + * added) and then take all mapping locks of the existing vmas. With that + * reclaim is effectively stopped. + */ +extern void mm_lock(struct mm_struct *mm); +extern void mm_unlock(struct mm_struct *mm); + extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, Index: linux-2.6/mm/mmap.c === --- linux-2.6.orig/mm/mmap.c2008-04-04 14:55:03.477593980 -0700 +++ linux-2.6/mm/mmap.c 2008-04-04 14:59:05.505395402 -0700 @@ -2242,3 +2242,69 @@ int install_special_mapping(struct mm_st return 0; } + +static void mm_lock_unlock(struct mm_struct *mm, int lock) +{ + struct vm_area_struct *vma; + spinlock_t *i_mmap_lock_last, *anon_vma_lock_last; + + i_mmap_lock_last = NULL; + for (;;) { + spinlock_t *i_mmap_lock = (spinlock_t *) -1UL; + for (vma = mm-mmap; vma; vma = vma-vm_next) + if (vma-vm_file vma-vm_file-f_mapping I think you can break this if() down a bit: if (!(vma-vm_file vma-vm_file-f_mapping)) continue; + (unsigned long) i_mmap_lock + (unsigned long) + vma-vm_file-f_mapping-i_mmap_lock + (unsigned long) + vma-vm_file-f_mapping-i_mmap_lock + (unsigned long) i_mmap_lock_last) + i_mmap_lock = + vma-vm_file-f_mapping-i_mmap_lock; So this is an O(n^2) algorithm to take the i_mmap_locks from low to high order? A comment would be nice. And O(n^2)? Ouch. How often is it called? And is it necessary to mush lock and unlock together? Unlock ordering doesn't matter, so you should just be able to have a much simpler loop, no? + if (i_mmap_lock == (spinlock_t *) -1UL) + break; + i_mmap_lock_last = i_mmap_lock; + if (lock) + spin_lock(i_mmap_lock); + else + spin_unlock(i_mmap_lock); + } + + anon_vma_lock_last = NULL; + for (;;) { + spinlock_t *anon_vma_lock = (spinlock_t *) -1UL; + for (vma = mm-mmap; vma; vma = vma-vm_next) + if (vma-anon_vma + (unsigned long) anon_vma_lock + (unsigned long) vma-anon_vma-lock + (unsigned long) vma-anon_vma-lock + (unsigned long) anon_vma_lock_last) + anon_vma_lock = vma-anon_vma-lock; + if (anon_vma_lock == (spinlock_t *) -1UL) + break; + anon_vma_lock_last = anon_vma_lock; + if (lock) + spin_lock(anon_vma_lock); + else + spin_unlock(anon_vma_lock); + } +} + +/* + * This operation locks against the VM for all pte/vma/mm related + * operations that could ever happen on a certain mm. This includes + * vmtruncate, try_to_unmap, and all page faults. The holder + * must not hold any mm related lock. A single task can't take more + * than one mm lock in a row or it would deadlock. + */ +void mm_lock(struct mm_struct * mm) +{ + down_write(mm-mmap_sem); + mm_lock_unlock(mm, 1); +} + +void mm_unlock(struct mm_struct *mm) +{ + mm_lock_unlock(mm, 0); + up_write(mm-mmap_sem); +} ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit
Re: [ofa-general] XmtDiscards
Hello Boris, On Fri, Apr 04, 2008 at 03:28:46PM -0700, Boris Shpolyansky wrote: Hi Bernd, You can configure the HOQ (Head-Of-Queue-Lifetime) value programmed in any switch in the fabric managed by OpenSM following these simple steps: 1. Stop the SM /etc/init.d/opensmd stop 2. Run the SM manually with the -c option (to dump its default configuration to a file) opensm -c 3. Kill the SM with ^C 4. The configuration is saved in /var/cache/opensm/opensm.opts. Open the file and look for head_of_queue_lifetime. Change the value and save the file. 5. Restart the SM /etc/init.d/opensmd start thanks a lot for your help. This did help quite a lot. P.S. You might find 'opensm -h' and 'man opensm' useful. Sorry about my dumb question, I did read the man page of opensm quite often already, but --cache-options and OSM_CACHE_DIR did activate my brain-internal filter to entirely skip this part of the man page ;) Somehow I associated cache with opensm-performance, but not at all with options... Thanks again, Bernd ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 2/4][v2] dapl: add support for logging errors in non-debug build.
Add debug logging (stdout, syslog) for error cases during device open, cm, async, and dto operations. Default settings are ERR for DAPL_DBG_TYPE, and stdout for DAPL_DBG_DEST. Change default configuration to build non-debug. Signed-off by: Arlin Davis [EMAIL PROTECTED] --- configure.in |4 +- dapl/common/dapl_debug.c |2 - dapl/common/dapl_evd_util.c|8 +- dapl/include/dapl_debug.h | 10 ++- dapl/openib_cma/dapl_ib_cm.c | 196 +++- dapl/openib_cma/dapl_ib_util.c | 87 +- dapl/udapl/dapl_init.c | 16 +++- dapl/udapl/linux/dapl_osd.h|2 +- 8 files changed, 179 insertions(+), 146 deletions(-) diff --git a/configure.in b/configure.in index eaf597b..d1c2664 100644 --- a/configure.in +++ b/configure.in @@ -42,12 +42,12 @@ AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test $ac_cv_version_script = yes) dnl Support debug mode build - if enable-debug provided the DEBUG variable is set AC_ARG_ENABLE(debug, -[ --enable-debug Turn on debug mode, default=on], +[ --enable-debug Turn on debug mode, default=off], [case ${enableval} in yes) debug=true ;; no) debug=false ;; *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; -esac],[debug=true]) +esac],[debug=false]) AM_CONDITIONAL(DEBUG, test x$debug = xtrue) dnl Support ib_extension build - if enable-ext-type == ib diff --git a/dapl/common/dapl_debug.c b/dapl/common/dapl_debug.c index 7ddce52..cbc356c 100644 --- a/dapl/common/dapl_debug.c +++ b/dapl/common/dapl_debug.c @@ -32,7 +32,6 @@ #include stdlib.h #endif /* __KDAPL__ */ -#ifdef DAPL_DBG DAPL_DBG_TYPE g_dapl_dbg_type; /* initialized in dapl_init.c */ DAPL_DBG_DEST g_dapl_dbg_dest; /* initialized in dapl_init.c */ @@ -117,5 +116,4 @@ void dapl_dump_cntr( int cntr ) } #endif /* DAPL_COUNTERS */ -#endif diff --git a/dapl/common/dapl_evd_util.c b/dapl/common/dapl_evd_util.c index a993b02..2ae1b59 100755 --- a/dapl/common/dapl_evd_util.c +++ b/dapl/common/dapl_evd_util.c @@ -1209,10 +1209,10 @@ dapli_evd_cqe_to_event ( dapl_os_unlock ( ep_ptr-header.lock ); } - dapl_dbg_log (DAPL_DBG_TYPE_DTO_COMP_ERR, - DTO completion ERROR: %d: op %#x (ep disconnected)\n, - DAPL_GET_CQE_STATUS (cqe_ptr), - DAPL_GET_CQE_OPTYPE (cqe_ptr)); + dapl_log(DAPL_DBG_TYPE_ERR, +DTO completion ERR: status %d, opcode %s \n, +DAPL_GET_CQE_STATUS(cqe_ptr), +DAPL_GET_CQE_OP_STR(cqe_ptr)); } } diff --git a/dapl/include/dapl_debug.h b/dapl/include/dapl_debug.h index 76db8fd..f0de7c8 100644 --- a/dapl/include/dapl_debug.h +++ b/dapl/include/dapl_debug.h @@ -75,14 +75,16 @@ typedef enum DAPL_DBG_DEST_SYSLOG = 0x0002, } DAPL_DBG_DEST; - -#if defined(DAPL_DBG) - extern DAPL_DBG_TYPE g_dapl_dbg_type; extern DAPL_DBG_DEST g_dapl_dbg_dest; +extern void dapl_internal_dbg_log(DAPL_DBG_TYPE type, const char *fmt, ...); + +#define dapl_log g_dapl_dbg_type==0 ? (void) 1 : dapl_internal_dbg_log + +#if defined(DAPL_DBG) + #define dapl_dbg_log g_dapl_dbg_type==0 ? (void) 1 : dapl_internal_dbg_log -extern void dapl_internal_dbg_log ( DAPL_DBG_TYPE type, const char *fmt, ...); #else /* !DAPL_DBG */ diff --git a/dapl/openib_cma/dapl_ib_cm.c b/dapl/openib_cma/dapl_ib_cm.c index a040ffb..33f299d 100755 --- a/dapl/openib_cma/dapl_ib_cm.c +++ b/dapl/openib_cma/dapl_ib_cm.c @@ -95,9 +95,9 @@ static void dapli_addr_resolve(struct dapl_cm_id *conn) ret = rdma_resolve_route(conn-cm_id, conn-route_timeout); if (ret) { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, - rdma_connect failed: %s\n,strerror(errno)); - + dapl_log(DAPL_DBG_TYPE_ERR, + dapl_cma_connect: rdma_resolve_route ERR %d %s\n, +ret, strerror(errno)); dapl_evd_connection_callback(conn, IB_CME_LOCAL_FAILURE, NULL, conn-ep); @@ -146,8 +146,9 @@ static void dapli_route_resolve(struct dapl_cm_id *conn) ret = rdma_connect(conn-cm_id, conn-params); if (ret) { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, rdma_connect failed: %s\n, -strerror(errno)); + dapl_log(DAPL_DBG_TYPE_ERR, + dapl_cma_connect: rdma_connect ERR %d %s\n, +ret, strerror(errno)); goto bail; } return; @@ -310,12 +311,15 @@ static void dapli_cm_active_cb(struct dapl_cm_id *conn, case RDMA_CM_EVENT_UNREACHABLE: case RDMA_CM_EVENT_CONNECT_ERROR: { - dapl_dbg_log( -DAPL_DBG_TYPE_WARN, - dapli_cm_active_handler: CONN_ERR -
[ofa-general] [PATCH 4/4][v2] dapl: update vendor information for OFA v2 provider
Signed-off by: Arlin Davis [EMAIL PROTECTED] --- dapl/include/dapl_vendor.h |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/dapl/include/dapl_vendor.h b/dapl/include/dapl_vendor.h index e87467a..f6d3cc0 100644 --- a/dapl/include/dapl_vendor.h +++ b/dapl/include/dapl_vendor.h @@ -52,14 +52,14 @@ * Product name of the adapter. * Returned in DAT_IA_ATTR.adapter_name */ -#define VN_ADAPTER_NAMEGeneric InfiniBand HCA +#define VN_ADAPTER_NAMEGeneric OpenFabrics HCA /* * Vendor name * Returned in DAT_IA_ATTR.vendor_name */ -#define VN_VENDOR_NAME DAPL Reference Implementation +#define VN_VENDOR_NAME DAPL OpenFabrics Implementation /** @@ -78,7 +78,7 @@ * DAT_PROVIDER_ATTR.provider_version_minor */ -#define VN_PROVIDER_MAJOR 1 +#define VN_PROVIDER_MAJOR 2 #define VN_PROVIDER_MINOR 0 /* -- 1.5.2.5 ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 3/4][v2] dapl: add provider vendor revision data in private data with reject
Add 1 byte header containing provider/vendor major revision to distinguish between consumer and non-consumer rejects. Validate size of consumer reject privated data. Signed-off by: Arlin Davis [EMAIL PROTECTED] --- dapl/openib_cma/dapl_ib_cm.c | 39 --- dapl/openib_cma/dapl_ib_util.h |2 +- 2 files changed, 33 insertions(+), 8 deletions(-) diff --git a/dapl/openib_cma/dapl_ib_cm.c b/dapl/openib_cma/dapl_ib_cm.c index 33f299d..dcdcc5b 100755 --- a/dapl/openib_cma/dapl_ib_cm.c +++ b/dapl/openib_cma/dapl_ib_cm.c @@ -45,6 +45,7 @@ #include dapl_cr_util.h #include dapl_name_service.h #include dapl_ib_util.h +#include dapl_vendor.h #include sys/poll.h #include signal.h #include sys/socket.h @@ -79,6 +80,14 @@ static inline uint64_t cpu_to_be64(uint64_t x) { return x; } #define PORT_TO_SID(p) ntohs(p) +/* private data header to validate consumer rejects versus abnormal events */ +struct dapl_pdata_hdr { + uint8_t version; +}; +static struct dapl_pdata_hdr pdata_hdr = { + .version = VN_PROVIDER_MAJOR +}; + static void dapli_addr_resolve(struct dapl_cm_id *conn) { int ret; @@ -900,6 +909,7 @@ dapls_ib_reject_connection( IN const DAT_PVOID private_data) { int ret; + int offset = sizeof(struct dapl_pdata_hdr); dapl_dbg_log(DAPL_DBG_TYPE_CM, reject(cm_handle %p reason %x)\n, @@ -909,14 +919,29 @@ dapls_ib_reject_connection( dapl_dbg_log(DAPL_DBG_TYPE_ERR, reject: invalid handle: reason %d\n, reason); - return DAT_SUCCESS; + return DAT_ERROR (DAT_INVALID_HANDLE,DAT_INVALID_HANDLE_CR); } - + +if (private_data_size + dapls_ib_private_data_size( + NULL, IB_MAX_REJ_PDATA_SIZE, cm_handle-hca)) + return DAT_ERROR(DAT_INVALID_PARAMETER, DAT_INVALID_ARG3); + + /* setup pdata_hdr and users data, in CR pdata buffer */ + dapl_os_memcpy(cm_handle-p_data, pdata_hdr, offset); + if (private_data_size) + dapl_os_memcpy(cm_handle-p_data+offset, + private_data, + private_data_size); + /* - * Private data is needed so peer can determine real application -* reject from an abnormal application termination +* Always some private data with reject so active peer can + * determine real application reject from an abnormal +* application termination */ - ret = rdma_reject(cm_handle-cm_id, NULL, 0); + ret = rdma_reject(cm_handle-cm_id, + cm_handle-p_data, + offset+private_data_size); dapli_destroy_conn(cm_handle); return dapl_convert_errno(ret, reject); @@ -1005,7 +1030,7 @@ int dapls_ib_private_data_size( IN DAPL_PRIVATE *prd_ptr, if (hca_ptr-ib_hca_handle-device-transport_type == IBV_TRANSPORT_IWARP) - return(IWARP_MAX_PDATA_SIZE); + return(IWARP_MAX_PDATA_SIZE-sizeof(struct dapl_pdata_hdr)); switch(conn_op) { @@ -1016,7 +1041,7 @@ int dapls_ib_private_data_size( IN DAPL_PRIVATE *prd_ptr, size = IB_MAX_REP_PDATA_SIZE; break; case DAPL_PDATA_CONN_REJ: - size = IB_MAX_REJ_PDATA_SIZE; + size = IB_MAX_REJ_PDATA_SIZE-sizeof(struct dapl_pdata_hdr); break; case DAPL_PDATA_CONN_DREQ: size = IB_MAX_DREQ_PDATA_SIZE; diff --git a/dapl/openib_cma/dapl_ib_util.h b/dapl/openib_cma/dapl_ib_util.h index f35cb9d..370f3b1 100755 --- a/dapl/openib_cma/dapl_ib_util.h +++ b/dapl/openib_cma/dapl_ib_util.h @@ -181,7 +181,7 @@ struct dapl_cm_id { struct rdma_conn_param params; DAT_SOCK_ADDR6 r_addr; int p_len; - unsigned char p_data[IB_MAX_DREP_PDATA_SIZE]; + unsigned char p_data[256]; /* dapl max private data size */ }; typedef struct dapl_cm_id *dp_ib_cm_handle_t; -- 1.5.2.5 ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] XmtDiscards
On Fri, Apr 04, 2008 at 03:29:32PM -0700, Ira Weiny wrote: On Sat, 5 Apr 2008 00:12:39 +0200 Bernd Schubert [EMAIL PROTECTED] wrote: Hello, after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten much better there, at least no further RcvSwRelayErrors, even when the cluster is in idle state and so far also no SymbolErrors, which we also have seens before. However, after I just started a lustre stress test on 50 clients (to a lustre storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports about 9000 XmtDiscards within 30 minutes. Yea, those are bad. Searching for this error I find This is a symptom of congestion and may require tweaking either HOQ or switch lifetime values. Well, I have to admit I neither know what HOQ is, nor do I know how to tweak it. I also do not have an idea to set switch lifetime values. I guess this isn't related to the opensm timeout option, is it? Yes you should adjust these values. Hmm, I just found a cisci pdf describing how to set the lifetime on these switches, but is this also possible on Flextronics switches? I don't know about the Vendor SMs but in opensm look for the following options in the opensm.opts file (Default path is: /var/cache/opensm): # The code of maximal time a packet can wait at the head of # transmission queue. # The actual time is 4.096usec * 2^head_of_queue_lifetime # The value 0x14 disables this mechanism head_of_queue_lifetime 0x12 # The maximal time a packet can wait at the head of queue on # switch port connected to a CA or router port leaf_head_of_queue_lifetime 0x0c Hmm, I first increased head_of_queue_lifetime to 0x13 and leaf_head_of_queue_lifetime to 0x20, but this didn't make the error go away. So I increased head_of_queue_lifetime to 0x15 and leaf_head_of_queue_lifetime to 0x50, but this made the fabric to entirely crash. On the node of the master opensm I got an endless number of messages like these: Apr 5 01:35:03 pfs1n2 kernel: [705448.344542] NETDEV WATCHDOG: ib0: transmit timed out Apr 5 01:35:03 pfs1n2 kernel: [705448.349814] ib0: transmit timeout: latency 411908 msecs Apr 5 01:35:03 pfs1n2 kernel: [705448.355364] ib0: queue stopped 1, tx_head 441, tx_tail 377 Apr 5 01:35:04 pfs1n2 kernel: [705449.343495] NETDEV WATCHDOG: ib0: transmit timed out The slave opensm also went into D-state and is not killable anymore :( Seems I have to be very careful with these settings... Thanks for your help, Bernd ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH 3/4][v2] dapl: add provider vendor revisiondata in private data with reject
Add 1 byte header containing provider/vendor major revision to distinguish between consumer and non-consumer rejects. Validate size of consumer reject privated data. Not saying this is a bad idea, but doesn't it break the protocol with existing DAPL? It also shifts all of the existing private data off by a byte, which could result in odd data alignment. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH] mmu notifier #v11
On Fri, Apr 04, 2008 at 03:06:18PM -0700, Christoph Lameter wrote: Adds some comments. Still objectionable is the multiple ways of invalidating pages in #v11. Callout now has similar locking to emm. range_begin exists because range_end is called after the page has already been freed. invalidate_page is called _before_ the page is freed but _after_ the pte has been zapped. In short when working with single pages it's a waste to block the secondary-mmu page fault, because it's zero cost to invalidate_page before put_page. Not even GRU need to do that. Instead for the multiple-pte-zapping we have to call range_end _after_ the page is already freed. This is so that there is a single range_end call for an huge amount of address space. So we need a range_begin for the subsystems not using page pinning for example. When working with single pages (try_to_unmap_one, do_wp_page) invalidate_page avoids to block the secondary mmu page fault, and it's in turn faster. Besides avoiding need of serializing the secondary mmu page fault, invalidate_page also reduces the overhead when the mmu notifiers are disarmed (i.e. kvm not running). ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [patch 01/10] emm: mm_lock: Lock a process against reclaim
On Fri, Apr 04, 2008 at 04:12:42PM -0700, Jeremy Fitzhardinge wrote: I think you can break this if() down a bit: if (!(vma-vm_file vma-vm_file-f_mapping)) continue; It makes no difference at runtime, coding style preferences are quite subjective. So this is an O(n^2) algorithm to take the i_mmap_locks from low to high order? A comment would be nice. And O(n^2)? Ouch. How often is it called? It's called a single time when the mmu notifier is registered. It's a very slow path of course. Any other approach to reduce the complexity would require memory allocations and it would require mmu_notifier_register to return -ENOMEM failure. It didn't seem worth it. And is it necessary to mush lock and unlock together? Unlock ordering doesn't matter, so you should just be able to have a much simpler loop, no? That avoids duplicating .text. Originally they were separated. unlock can't be a simpler loop because I didn't reserve vm_flags bitflags to do a single O(N) loop for unlock. If you do malloc+fork+munmap two vmas will point to the same anon-vma lock, that's why the unlock isn't simpler unless I mark what I locked with a vm_flags bitflag. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] XmtDiscards
Bernd, 0x14 is the maximal value for HOQ lifetime, which effectively disables the mechanism. I think you shouldn't exceed this value. Boris -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Bernd Schubert Sent: Friday, April 04, 2008 4:46 PM To: Ira Weiny Cc: general@lists.openfabrics.org Subject: Re: [ofa-general] XmtDiscards On Fri, Apr 04, 2008 at 03:29:32PM -0700, Ira Weiny wrote: On Sat, 5 Apr 2008 00:12:39 +0200 Bernd Schubert [EMAIL PROTECTED] wrote: Hello, after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten much better there, at least no further RcvSwRelayErrors, even when the cluster is in idle state and so far also no SymbolErrors, which we also have seens before. However, after I just started a lustre stress test on 50 clients (to a lustre storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports about 9000 XmtDiscards within 30 minutes. Yea, those are bad. Searching for this error I find This is a symptom of congestion and may require tweaking either HOQ or switch lifetime values. Well, I have to admit I neither know what HOQ is, nor do I know how to tweak it. I also do not have an idea to set switch lifetime values. I guess this isn't related to the opensm timeout option, is it? Yes you should adjust these values. Hmm, I just found a cisci pdf describing how to set the lifetime on these switches, but is this also possible on Flextronics switches? I don't know about the Vendor SMs but in opensm look for the following options in the opensm.opts file (Default path is: /var/cache/opensm): # The code of maximal time a packet can wait at the head of # transmission queue. # The actual time is 4.096usec * 2^head_of_queue_lifetime # The value 0x14 disables this mechanism head_of_queue_lifetime 0x12 # The maximal time a packet can wait at the head of queue on # switch port connected to a CA or router port leaf_head_of_queue_lifetime 0x0c Hmm, I first increased head_of_queue_lifetime to 0x13 and leaf_head_of_queue_lifetime to 0x20, but this didn't make the error go away. So I increased head_of_queue_lifetime to 0x15 and leaf_head_of_queue_lifetime to 0x50, but this made the fabric to entirely crash. On the node of the master opensm I got an endless number of messages like these: Apr 5 01:35:03 pfs1n2 kernel: [705448.344542] NETDEV WATCHDOG: ib0: transmit timed out Apr 5 01:35:03 pfs1n2 kernel: [705448.349814] ib0: transmit timeout: latency 411908 msecs Apr 5 01:35:03 pfs1n2 kernel: [705448.355364] ib0: queue stopped 1, tx_head 441, tx_tail 377 Apr 5 01:35:04 pfs1n2 kernel: [705449.343495] NETDEV WATCHDOG: ib0: transmit timed out The slave opensm also went into D-state and is not killable anymore :( Seems I have to be very careful with these settings... Thanks for your help, Bernd ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [patch 02/10] emm: notifier logic
On Fri, Apr 04, 2008 at 03:30:50PM -0700, Christoph Lameter wrote: + mm_lock(mm); + e-next = mm-emm_notifier; + /* + * The update to emm_notifier (e-next) must be visible + * before the pointer becomes visible. + * rcu_assign_pointer() does exactly what we need. + */ + rcu_assign_pointer(mm-emm_notifier, e); + mm_unlock(mm); My mm_lock solution makes all rcu serialization an unnecessary overhead so you should remove it like I already did in #v11. If it wasn't the case, then mm_lock wouldn't be a definitive fix for the race. + e = rcu_dereference(e-next); Same here. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general