date:20080404

[ofa-general] [PATCH] parse_node_map: print parse errors

2008-04-04 Thread Bernd Schubert

Hello,

could you please add the patch below, without it I probably never would have 
realized why my node name map was not accepted. 

Btw, I'm a bit surprised there don't seem to be any default wrappers, for 
fopen(), fclose(), malloc(), fprintf(), etc.

diff -rup opensm-3.2.1.old/complib/cl_nodenamemap.c 
opensm-3.2.1/complib/cl_nodenamemap.c
--- opensm-3.2.1.old/complib/cl_nodenamemap.c   2008-04-03 13:17:35.0 
+0200
+++ opensm-3.2.1/complib/cl_nodenamemap.c   2008-04-04 11:09:42.0 
+0200
@@ -55,8 +55,11 @@ static int map_name(void *cxt, uint64_t 
return 0;
 
item = malloc(sizeof(*item));
-   if (!item)
+   if (!item) {
+   fprintf(stderr, Malloc failed, sizeof(*item) = %d.\n, 
sizeof(*item));
return -1;
+   }
+   
item-guid = guid;
item-name = strdup(p);
cl_qmap_insert(map, item-guid, (cl_map_item_t *)item);
@@ -169,6 +172,8 @@ int parse_node_map(const char *file_name
guid = strtoull(p, e, 0);
if (e == p || (!isspace(*e)  *e != '#'  *e != '\0')) {
fclose(f);
+   fprintf (stderr, %s: Parse error in line: %s\n,
+__func__, line);
return -1;
}
 

Thanks,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] ERR 0108: Unknown remote side

2008-04-04 Thread Bernd Schubert

Hello,

opensm-3.2.1 logs some error messages like this:

Apr 04 00:00:08 325114 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: 
ERR 0108: Unknown remote side for node 0
x000b8c002ba2(SW_pfs1_leaf4) port 13. Adding to light sweep sampling list
Apr 04 00:00:08 325126 [4580A960] 0x01 - Directed Path Dump of 3 hop path:
Path = 0,1,14,13


From ibnetdiscover output I see port13 of this switch is a switch-interconnect 
(sorry, I don't know what the correct name/identifier for switches within 
switches):

[13]S-000b8c002bfa[13]# SW_pfs1_inter7 lid 263 
4xSDR


Apr 04 00:00:08 325219 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: 
ERR 0108: Unknown remote side for node 0
x000b8c002bf9(SW_pfs1_inter6) port 9. Adding to light sweep sampling list
Apr 04 00:00:08 325234 [4580A960] 0x01 - Directed Path Dump of 2 hop path:
Path = 0,1,18

This is again an interconnection:

[9] S-000b8c002b9e[15]# SW_pfs1_leaf1 lid 177 
4xDDR


Apr 04 00:00:08 325288 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: 
ERR 0108: Unknown remote side for node 0
x000b8c002bfa(SW_pfs1_inter7) port 13. Adding to light sweep sampling list
Apr 04 00:00:08 325301 [4580A960] 0x01 - Directed Path Dump of 2 hop path:
Path = 0,1,14


And again an interconnection:

[13]S-000b8c002ba2[13]# SW_pfs1_leaf4 lid 182 
4xDDR


All the other interconnections seem to be fine. 


Thanks,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: EMM: disable other notifiers before register and unregister

2008-04-04 Thread Andrea Arcangeli

On Thu, Apr 03, 2008 at 12:20:41PM -0700, Christoph Lameter wrote:
 On Thu, 3 Apr 2008, Andrea Arcangeli wrote:
 
  My attempt to fix this once and for all is to walk all vmas of the
  mm inside mmu_notifier_register and take all anon_vma locks and
  i_mmap_locks in virtual address order in a row. It's ok to take those
  inside the mmap_sem. Supposedly if anybody will ever take a double
  lock it'll do in order too. Then I can dump all the other locking and
 
 What about concurrent mmu_notifier registrations from two mm_structs 
 that have shared mappings? Isnt there a potential deadlock situation?

No, the ordering of the lock avoids that. Here a snippnet.

/*
 * This operation locks against the VM for all pte/vma/mm related
 * operations that could ever happen on a certain mm. This includes
 * vmtruncate, try_to_unmap, and all page faults. The holder
 * must not hold any mm related lock. A single task can't take more
 * than one mm lock in a row or it would deadlock.
 */

So you can't do:

   mm_lock(mm1);
   mm_lock(mm2);

But if two different tasks run the mm_lock everything is ok. Each task
in the system can lock at most 1 mm at time.

 Well good luck. Hopefully we will get to something that works.

Looks good so far but I didn't finish it yet.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] 2003 microsoft office professional with business contact manager for outlook - $69

2008-04-04 Thread Elijah Simmons


Type %lunoem. com% in Inter_net_Exp1o_rer
Please kill any %%% symbols from address

roxio easy media creator 8 - $39
adobe after effects cs3 - $69
adobe font folio 11 - $189
adobe photoshop cs3 extended - $89
microsoft visual basic professional 6.0 - $49
adobe audition 2.0 - $49
ulead photoimpact 12 - $79

Goto %lunoem. com%

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] can not join due to rate:2.5Gbps group:10Gbps?

2008-04-04 Thread Brian J. Murrell

I'm trying to get a few nodes here connected with IPoIB.  On the first
node I have tried with, after ifconfig'ing the interface into the
network with other IPoIB nodes I cannot seem to ping any other nodes.  I
ran ibdiagnet and got a /tmp/ibdiagnet.pkey file with the following
contents:

sata14:/ # cat /tmp/ibdiagnet.pkey
GROUP PKey:0x7fff Hosts:4
   Full sata15/P2 lid=0x0004 guid=0x00066a01a363 dev=23108
   Full sata14/P2 lid=0x0006 guid=0x00066a01a2bf dev=23108
   Full sata23/P2 lid=0x0008 guid=0x00066a01a2fe dev=23108
   Full sata16/P2 lid=0x0007 guid=0x00066a01a2c1 dev=23108

When I run an ibdiagpath -l 0x0004 I get the following:

-W- Topology file is not specified.
Reports regarding cluster links will use direct routes.
-I- Using port 2 as the local port.

-I---
-I- Traversing the path from local to destination
-I---
-I- From: lid=0x0006 guid=0x00066a01a2bf dev=23108 sata14/P2
-I- To:   lid=0x0001 guid=0x00066a00c8000180 dev=5 Port=1

-I- From: lid=0x0001 guid=0x00066a00c8000180 dev=5 Port=2
-I- To:   lid=0x0004 guid=0x00066a01a363 dev=23108 sata15/P2


-I---
-I- PM Counters Info
-I---
-I- No illegal PM counters values were found

-I---
-I- Path Partitions Report
-I---
-I- Source sata14/P2 lid=0x0006 guid=0x00066a01a2bf dev=23108 Port 2
PKeys:0x
-I- Destination sata15 lid=0x0004 guid=0x00066a01a363 dev=23108 PKeys:0x
-I- Path shared PKeys: 0x

-I---
-I- IPoIB Path Check
-I---
-I- Subnet: IPv4 PKey:0x7fff QKey:0x MTU:2048Byte rate:10Gbps SL:0x00
-W- Port sata14/P2 lid=0x0006 guid=0x00066a01a2bf dev=23108 can not join due
to rate:2.5Gbps  group:10Gbps
-W- Port sata15/P2 lid=0x0004 guid=0x00066a01a363 dev=23108 can not join due
to rate:2.5Gbps  group:10Gbps
-E- No IPoIB Subnets found on Path! Nodes can not communicate via IPoIB!

-I---
-I- QoS on Path Check
-I---
-W- Blocked VLs:4 5 at node:sata14 lid=0x0006 guid=0x00066a01a2bf dev=23108
port:2
-W- Blocked VLs:4 5 at node: lid=0x0001 guid=0x00066a00c8000180 dev=5 port:2
-I- The following SLs can be used:0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
 
-I- Done. Run time was 0 seconds.

That IPoIB Path Check looks a bit alarming.

Anyone have any suggestions?

b.




signature.asc
Description: This is a digitally signed message part
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?

2008-04-04 Thread Steve Wise




Or Gerlitz wrote:

On Thu, Apr 3, 2008 at 6:17 PM, Steve Wise [EMAIL PROTECTED] wrote:

I think RDS might be getting confused because the 10GbE rnic shows up as a
dumb NIC hooked into the native TCP stack -and- an rdma device.



Jon Mason will be working to enable RDS soon on the chelsio device. He'll
feed back the changes needed, if any, to RDS.  Stay tuned.


Steve,

I understand that a similar work has been done at least to some extent
with open MPI, and I will be
very happy to hear the lessons learned. Did you manage to have the
same (say point to point)
open mpi  transport  design/code work over rdma-cm over both IB and iWARP?



Definitely.  We're running over rdma-cm over mthca and cxgb3 on 2 nodes 
today.  8 nodes over cxgb3.  We're working out the details now.



Can someone from OGC or Chelsio drive a BOF on that in Sonoma?

If not, can some notes be sent to the list? I say lets learn from what
you did so far...



We won't be in Sonoma, but perhaps Jon can email some info to the list 
on what we've done to-date for open mpi.


Steve.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps?

2008-04-04 Thread Hal Rosenstock

On Fri, 2008-04-04 at 10:36 -0400, Brian J. Murrell wrote:
 I'm trying to get a few nodes here connected with IPoIB.  On the first
 node I have tried with, after ifconfig'ing the interface into the
 network with other IPoIB nodes I cannot seem to ping any other nodes.  I
 ran ibdiagnet and got a /tmp/ibdiagnet.pkey file with the following
 contents:
 
 sata14:/ # cat /tmp/ibdiagnet.pkey
 GROUP PKey:0x7fff Hosts:4
Full sata15/P2 lid=0x0004 guid=0x00066a01a363 dev=23108
Full sata14/P2 lid=0x0006 guid=0x00066a01a2bf dev=23108
Full sata23/P2 lid=0x0008 guid=0x00066a01a2fe dev=23108
Full sata16/P2 lid=0x0007 guid=0x00066a01a2c1 dev=23108
 
 When I run an ibdiagpath -l 0x0004 I get the following:
 
 -W- Topology file is not specified.
 Reports regarding cluster links will use direct routes.
 -I- Using port 2 as the local port.
 
 -I---
 -I- Traversing the path from local to destination
 -I---
 -I- From: lid=0x0006 guid=0x00066a01a2bf dev=23108 sata14/P2
 -I- To:   lid=0x0001 guid=0x00066a00c8000180 dev=5 Port=1
 
 -I- From: lid=0x0001 guid=0x00066a00c8000180 dev=5 Port=2
 -I- To:   lid=0x0004 guid=0x00066a01a363 dev=23108 sata15/P2
 
 
 -I---
 -I- PM Counters Info
 -I---
 -I- No illegal PM counters values were found
 
 -I---
 -I- Path Partitions Report
 -I---
 -I- Source sata14/P2 lid=0x0006 guid=0x00066a01a2bf dev=23108 Port 2
 PKeys:0x
 -I- Destination sata15 lid=0x0004 guid=0x00066a01a363 dev=23108 
 PKeys:0x
 -I- Path shared PKeys: 0x
 
 -I---
 -I- IPoIB Path Check
 -I---
 -I- Subnet: IPv4 PKey:0x7fff QKey:0x MTU:2048Byte rate:10Gbps SL:0x00
 -W- Port sata14/P2 lid=0x0006 guid=0x00066a01a2bf dev=23108 can not join 
 due
 to rate:2.5Gbps  group:10Gbps
 -W- Port sata15/P2 lid=0x0004 guid=0x00066a01a363 dev=23108 can not join 
 due
 to rate:2.5Gbps  group:10Gbps
 -E- No IPoIB Subnets found on Path! Nodes can not communicate via IPoIB!
 
 -I---
 -I- QoS on Path Check
 -I---
 -W- Blocked VLs:4 5 at node:sata14 lid=0x0006 guid=0x00066a01a2bf 
 dev=23108
 port:2
 -W- Blocked VLs:4 5 at node: lid=0x0001 guid=0x00066a00c8000180 dev=5 port:2
 -I- The following SLs can be used:0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  
 -I- Done. Run time was 0 seconds.
 
 That IPoIB Path Check looks a bit alarming.
 
 Anyone have any suggestions?

Looks like you have a mixed rate set of ports so you need to configure
the group to 2.5 Gbps. What SM are you using ?

-- Hal

 b.
 
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps?

2008-04-04 Thread Brian J. Murrell

On Fri, 2008-04-04 at 07:55 -0700, Hal Rosenstock wrote:
 
 Looks like you have a mixed rate set of ports so you need to configure
 the group to 2.5 Gbps.

I'm a bit green with I/B, so please bear with me if you can.  I do
understand that there can be mixed rates depending on hardware.  But the
hardware guys assure me the cards in these machines should be able to
do 10Gbps.  Maybe they are wrong.  The card is listing as:

06:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)

  What SM are you using ?

That's a good question.  I suspect it's running on the switch.  I don't
know any details on the switch (yet) though.  I will need to engage the
hardware folks to determine this.  I did get an error when when ran
ibdiagnet about more than 1 master SM running when I started opensmd on
one of the nodes and none of the other nodes are running an SM so that
only leaves the switch.

In my limited exposure to IB, running the SM on the switch has always
yielded bad results.  I will see if I can get them to disable it.

b.



signature.asc
Description: This is a digitally signed message part
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] Re: [ewg] OFED March 24 meeting summary on OFED 1.4 plans

2008-04-04 Thread Tang, Changqing

What I mean claim to support is to have more people to test with this config.

--CQ

 -Original Message-
 From: Or Gerlitz [mailto:[EMAIL PROTECTED]
 Sent: Thursday, April 03, 2008 11:18 PM
 To: Tang, Changqing
 Cc: general@lists.openfabrics.org; [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: [ewg] OFED March 24 meeting
 summary on OFED 1.4 plans

 On Thu, Apr 3, 2008 at 5:40 PM, Tang, Changqing
 [EMAIL PROTECTED] wrote:

   The problem is, from MPI side, (and by default), we don't
 know which
  port is on which  fabric, since the subnet prefix is the
 same. We rely
  on system admin to config two  different subnet prefixes
 for HP-MPI to work.

   No vendor has claimed to support this.

 CQ, not supporting a different subnet prefix per IB subnet is
 against IB nature, I don't think there should be any problem
 to configure a different prefix at each open SM instance and
 the Linux host stack would work perfectly under this config.
 If you are a ware to any problem in the opensm and/or the
 host stack please let the community know and the maintainers
 will fix it.

 Or.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps?

2008-04-04 Thread Todd Rimmer

 From: Hal Rosenstock
 Sent: Friday, April 04, 2008 11:08 AM
 To: Brian J. Murrell
 Cc: general@lists.openfabrics.org
 Subject: Re: [ofa-general] can not join due to rate:2.5Gbps 
 group:10Gbps?

 On Fri, 2008-04-04 at 11:05 -0400, Brian J. Murrell wrote:
  On Fri, 2008-04-04 at 07:55 -0700, Hal Rosenstock wrote:

   Looks like you have a mixed rate set of ports so you need to
configure
   the group to 2.5 Gbps.

  I'm a bit green with I/B, so please bear with me if you can.  I do
  understand that there can be mixed rates depending on hardware.  But
the
  hardware guys assure me the cards in these machines should be able
to
  do 10Gbps.  Maybe they are wrong.  The card is listing as:

  06:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev
a1)
I would not recommend reconfiguring your SM for this situation.
Instead, you most likely have a bad cable or possibly a bad HCA or
switch port.  All IB products shipped within the last 6 years support
10g, so the fact your system has negotiated to 2.5g indicates a problem
with the link.

Bad or poorly connected cables are the typical cause.

Todd Rimmer
Chief Architect 
QLogic System Interconnect Group
Voice: 610-233-4852 Fax: 610-233-4777
[EMAIL PROTECTED]  www.QLogic.com

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps?

2008-04-04 Thread Hal Rosenstock

On Fri, 2008-04-04 at 10:14 -0500, Todd Rimmer wrote:
  From: Hal Rosenstock
  Sent: Friday, April 04, 2008 11:08 AM
  To: Brian J. Murrell
  Cc: general@lists.openfabrics.org
  Subject: Re: [ofa-general] can not join due to rate:2.5Gbps 
  group:10Gbps?

  On Fri, 2008-04-04 at 11:05 -0400, Brian J. Murrell wrote:
   On Fri, 2008-04-04 at 07:55 -0700, Hal Rosenstock wrote:

Looks like you have a mixed rate set of ports so you need to
 configure
the group to 2.5 Gbps.

   I'm a bit green with I/B, so please bear with me if you can.  I do
   understand that there can be mixed rates depending on hardware.  But
 the
   hardware guys assure me the cards in these machines should be able
 to
   do 10Gbps.  Maybe they are wrong.  The card is listing as:

   06:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev
 a1)
 I would not recommend reconfiguring your SM for this situation.
 Instead, you most likely have a bad cable or possibly a bad HCA or
 switch port.  All IB products shipped within the last 6 years support
 10g, so the fact your system has negotiated to 2.5g indicates a problem
 with the link.

 Bad or poorly connected cables are the typical cause.

Yes, this seems right; I misread this as the DDR/SDR issue. I would
doubt he has any 1x hardware.

-- Hal

 Todd Rimmer
 Chief Architect 
 QLogic System Interconnect Group
 Voice: 610-233-4852 Fax: 610-233-4777
 [EMAIL PROTECTED]  www.QLogic.com

 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps?

2008-04-04 Thread Brian J. Murrell

On Fri, 2008-04-04 at 10:14 -0500, Todd Rimmer wrote:
 I would not recommend reconfiguring your SM for this situation.

Indeed, if what you say below pans out, I'd rather not.

 Instead, you most likely have a bad cable or possibly a bad HCA or
 switch port.  All IB products shipped within the last 6 years support
 10g, so the fact your system has negotiated to 2.5g indicates a problem
 with the link.

OK.  I will investigate this.  Is there any more direct method of
determining what rate an HCA has negotiated than using the ibdiagpath
-l $nid mechanism that I have been using?  It seems like a kind of
round-about method of getting that information.

 Bad or poorly connected cables are the typical cause.

I will have the hardware guys take another look at that.

Thanx for all the pointers!

b.



signature.asc
Description: This is a digitally signed message part
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] Re: [ewg] OFED March 24 meeting summary on OFED 1.4 plans

2008-04-04 Thread Tang, Changqing

   for example, in MPI, process A know the HCA guid on another node.
  After running for  some time, the switch is restarted for
 some reason, and the whole fabric is re-configured.


 CQ,

 If by the whole fabric is re-configured you refer to a case
 where a subnet prefix changes while a job runs and a process
 is detached/reattached to the job  so now you want to adopt
 your design to handle it, is over engineering, why you want
 to do that?


I am concerning the port lid change. It is always the best if a process can 
figure
the info it needs by itself, SA query is the right way and is in IB spec.

while it is possible to let processes to exchange information(port lid) again, 
but
there are difficulties: during the middle of a long job run, it is hard to let 
two
processes to coordinate such infomation exchange, and it requires a second 
channel
to do so. If the second channel is IPoIB, it is broken as well, and we need to 
re-establish
it again.

I just ask for the SA functionalities. If it is not possible, we have to use a 
very
complicated way to let HP-MPI to survive from network failure.


--CQ



 Or.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps?

2008-04-04 Thread Hal Rosenstock

On Fri, 2008-04-04 at 11:25 -0400, Brian J. Murrell wrote:
 On Fri, 2008-04-04 at 10:14 -0500, Todd Rimmer wrote:
  I would not recommend reconfiguring your SM for this situation.
 
 Indeed, if what you say below pans out, I'd rather not.
 
  Instead, you most likely have a bad cable or possibly a bad HCA or
  switch port.  All IB products shipped within the last 6 years support
  10g, so the fact your system has negotiated to 2.5g indicates a problem
  with the link.
 
 OK.  I will investigate this.  Is there any more direct method of
 determining what rate an HCA has negotiated than using the ibdiagpath
 -l $nid mechanism that I have been using?  It seems like a kind of
 round-about method of getting that information.

Try ibcheckwidth for this particular problem

  Bad or poorly connected cables are the typical cause.
 
 I will have the hardware guys take another look at that.
 
 Thanx for all the pointers!
 
 b.
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] linux-next: infiniband build failure

2008-04-04 Thread Roland Dreier

  drivers/infiniband/hw/ehca/ehca_reqs.c: In function 'ehca_write_swqe':
  drivers/infiniband/hw/ehca/ehca_reqs.c:191: error: 'const struct ib_send_wr' 
  has no member named 'imm_data'

Oops, thanks, I forgot to run my cross-compile (and ehca is ppc only).

Anyway, your fix is correct and I rolled it into my patch.

Thanks!
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [PATCH/RFC 2/2] RDMA/amso1100: Add support for send with invalidate work requests

2008-04-04 Thread Talpey, Thomas

At 08:52 PM 4/3/2008, Roland Dreier wrote:
But does this code start working if we add the two patches I posted?  I
don't understand how you could do anything useful with the current state
of things plus send w/inval for amso1100.

Does send w/inv actually work end-to-end on the Ammasso? Who's testing it?
Just wondering.

Tom.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?

2008-04-04 Thread Roland Dreier

  If not, can some notes be sent to the list? I say lets learn from what
  you did so far...

In my experience, getting code to work over both IB and iWARP isn't that
hard.  The main points are:

 - Use the RDMA CM for connection establishment (duh)
 - Memory regions used to receive RDMA read responses must have remote
   write permission (since in the iWARP protocol, RDMA read responses
   are basically the same as incoming RDMA write requests)
 - Active side of the connection must do the first operation
 - Don't use IB-specific features (atomics, immediate data)

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: linux-next: infiniband build failure

2008-04-04 Thread Roland Dreier

  Roland wanted the ib patch to go through my tree, and I figure we will
  work out these issues during the 2 week merge window.

Actually I said I was fine with whatever you wanted to do :)

Given that the new device support for ipath seems to cause problems for
ib-convert-struct-class_device-to-struct-device.patch, it seems it might
be simpler for me to carry that in my tree.  If someone sends me the
latest patch I'll be happy to merge it in (and do the fixups for the
ipath changes).

Then the final struct class_device removal just needs to be merged late
-- I'll send my tree to Linus to pull in the first day or two of the
merge window so I shouldn't be a problem.

Stephen, Greg, I really have the simplest job here managing my tree,
compared to you two guys, so as before just let me know how you want to
handle this ;)

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Hot video of your high school teacher

2008-04-04 Thread burnard edison

UUFyWibTLk
 Watch the video nowoOPqUUFyWib___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] can not join due to rate:2.5Gbps group:10Gbps?

2008-04-04 Thread Brian J. Murrell

On Fri, 2008-04-04 at 08:29 -0700, Hal Rosenstock wrote:
 
 Try ibcheckwidth for this particular problem

Well, seems I solved the problem after finding the ibstatus command.

Seems the hardware guys plugged port 2 into the switch because port 1 of
one of the HCAs in one of the machines is broken.

Thanx for all of the help!

b.



signature.asc
Description: This is a digitally signed message part
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] ERR 0108: Unknown remote side

2008-04-04 Thread Hal Rosenstock

On Fri, 2008-04-04 at 11:47 +0200, Bernd Schubert wrote:
 Hello,
 
 opensm-3.2.1 logs some error messages like this:
 
 Apr 04 00:00:08 325114 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: 
 ERR 0108: Unknown remote side for node 0
 x000b8c002ba2(SW_pfs1_leaf4) port 13. Adding to light sweep sampling list
 Apr 04 00:00:08 325126 [4580A960] 0x01 - Directed Path Dump of 3 hop path:
 Path = 0,1,14,13
 
 
 From ibnetdiscover output I see port13 of this switch is a 
 switch-interconnect 
 (sorry, I don't know what the correct name/identifier for switches within 
 switches):
 
 [13]S-000b8c002bfa[13]# SW_pfs1_inter7 lid 263 
 4xSDR
 
 
 Apr 04 00:00:08 325219 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: 
 ERR 0108: Unknown remote side for node 0
 x000b8c002bf9(SW_pfs1_inter6) port 9. Adding to light sweep sampling list
 Apr 04 00:00:08 325234 [4580A960] 0x01 - Directed Path Dump of 2 hop path:
 Path = 0,1,18
 
 This is again an interconnection:
 
 [9] S-000b8c002b9e[15]# SW_pfs1_leaf1 lid 177 
 4xDDR
 
 
 Apr 04 00:00:08 325288 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: 
 ERR 0108: Unknown remote side for node 0
 x000b8c002bfa(SW_pfs1_inter7) port 13. Adding to light sweep sampling list
 Apr 04 00:00:08 325301 [4580A960] 0x01 - Directed Path Dump of 2 hop path:
 Path = 0,1,14
 
 
 And again an interconnection:
 
 [13]S-000b8c002ba2[13]# SW_pfs1_leaf4 lid 182 
 4xDDR
 
 
 All the other interconnections seem to be fine. 

Any idea if OpenSM 3.1.10 has the same issue as 3.2.1 ?

Is this some large Flextronics switch ?

-- Hal

 Thanks,
 Bernd
 
 

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH] AMSO1100: Add check for NULL reply_msg in c2_intr

2008-04-04 Thread Tom Tucker


AMSO1100: Add check for NULL reply_msg in c2_intr

This is a checker-found bug posted to bugzilla.kernel.org (7478). Upon
inspection I also found a place where we could attempt to kmem_cache_free
a null pointer.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

Roland,

I don't think anyone has ever hit this bug, so it is a low priority in my view. 
I also noticed that
if we refactored vq_wait_for_reply that we could combine a common 

if (!reply) {
err = -ENOMEM;
goto bail;
}

construct by guaranteeing that reply is non-null if vq_wait_for_reply returns 
without
an error. This patch, however, is much smaller. What do you think?

 drivers/infiniband/hw/amso1100/c2_cq.c   |4 ++--
 drivers/infiniband/hw/amso1100/c2_intr.c |6 +-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c 
b/drivers/infiniband/hw/amso1100/c2_cq.c
index d2b3366..bb17cce 100644
--- a/drivers/infiniband/hw/amso1100/c2_cq.c
+++ b/drivers/infiniband/hw/amso1100/c2_cq.c
@@ -422,8 +422,8 @@ void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq)
goto bail1;
 
reply = (struct c2wr_cq_destroy_rep *) (unsigned long) 
(vq_req-reply_msg);
-
-   vq_repbuf_free(c2dev, reply);
+   if (reply)
+   vq_repbuf_free(c2dev, reply);
   bail1:
vq_req_free(c2dev, vq_req);
   bail0:
diff --git a/drivers/infiniband/hw/amso1100/c2_intr.c 
b/drivers/infiniband/hw/amso1100/c2_intr.c
index 0d0bc33..3b50954 100644
--- a/drivers/infiniband/hw/amso1100/c2_intr.c
+++ b/drivers/infiniband/hw/amso1100/c2_intr.c
@@ -174,7 +174,11 @@ static void handle_vq(struct c2_dev *c2dev, u32 mq_index)
return;
}
 
-   err = c2_errno(reply_msg);
+   if (reply_msg)
+   err = c2_errno(reply_msg);
+   else
+   err = -ENOMEM;
+
if (!err) switch (req-event) {
case IW_CM_EVENT_ESTABLISHED:
c2_set_qp_state(req-qp,

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] error with ibv_poll_cq() call

2008-04-04 Thread Roland Dreier

OK, I committed my change to libmlx4 and the equivalent thing for libmthca.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH] AMSO1100: Add check for NULL reply_msg in c2_intr

2008-04-04 Thread Roland Dreier

  I don't think anyone has ever hit this bug, so it is a low priority in my 
  view. I also noticed that
  if we refactored vq_wait_for_reply that we could combine a common 
  
  if (!reply) {
   err = -ENOMEM;
   goto bail;
  }
  
  construct by guaranteeing that reply is non-null if vq_wait_for_reply 
  returns without
  an error. This patch, however, is much smaller. What do you think?

Well, now is a good time to merge either version of the fix.  Would be
nice to kill off one of the Coverity issues so I'm happy to take this.

It's up to you how much effort you want to spend on this... the
refactoring sounds nice but I think we're OK without it.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] where to report bugs?

2008-04-04 Thread Brian J. Murrell

I'm wondering what the official mechanism is to report bugs?  Just about
anything I'm going to find is likely to be limited to build and
installation bugs, like this one...

In infiniband-diags-1.3.6/Makefile.am we have the line:

INCLUDES = -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband

This is assuming that other OFED packages have been installed in the
general system $PREFIX, usually /usr as $includedir should
be /usr/include.

But in particular, I have installed the opensm{,-devel} in an alternate
location (i.e. PREFIX) and the infiniband-diags build fails with:

if gcc -DHAVE_CONFIG_H -I. -I. -I. -I./include -I/usr/include 
-I/usr/include/infiniband  -I/home/brian/ofed_1.3_integration/tree/usr/include 
-Wall  -I/home/brian/ofed_1.3_integration/tree/usr/include -O2 -g 
-fmessage-length=0 -D_FORTIFY_SOURCE=2 -MT src_ibnetdiscover-ibnetdiscover.o 
-MD -MP -MF .deps/src_ibnetdiscover-ibnetdiscover.Tpo -c -o 
src_ibnetdiscover-ibnetdiscover.o `test -f 'src/ibnetdiscover.c' || echo 
'./'`src/ibnetdiscover.c; \
then mv -f .deps/src_ibnetdiscover-ibnetdiscover.Tpo 
.deps/src_ibnetdiscover-ibnetdiscover.Po; else rm -f 
.deps/src_ibnetdiscover-ibnetdiscover.Tpo; exit 1; fi
In file included from src/ibnetdiscover.c:53:
/home/brian/ofed_1.3_integration/tree/usr/include/infiniband/complib/cl_nodenamemap.h:39:29:
 error: complib/cl_qmap.h: No such file or directory
In file included from src/ibnetdiscover.c:53:
/home/brian/ofed_1.3_integration/tree/usr/include/infiniband/complib/cl_nodenamemap.h:45:
 error: expected specifier-qualifier-list before ‘cl_map_item_t’
/home/brian/ofed_1.3_integration/tree/usr/include/infiniband/complib/cl_nodenamemap.h:51:
 error: expected specifier-qualifier-list before ‘cl_qmap_t’
make[1]: *** [src_ibnetdiscover-ibnetdiscover.o] Error 1
make[1]: Leaving directory `/home/brian/rpm/BUILD/infiniband-diags-1.3.6'

On my system, with opensm-devel (and all other OFED RPMs) installed in
an alternate PREFIX, the above list of include paths should be
s#/usr/include/infiniband#PREFIX/include/infiniband#.

It seems probably infiniband-diags needs to have the same --with-osm
switch that ibutils has.

b.



signature.asc
Description: This is a digitally signed message part
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] InfiniBand/iWARP/RDMA merge plans for 2.6.26 (what's in infiniband.git)

2008-04-04 Thread Richard Frank


 We want to add send with invalidate  mask compare and swap.
 Eli will be able to send the patches next week and since they are
 small I think they can be in for 2.6.26

We are very interested in these new operations and are moving in the 
direction of tightly integrating RDMA along with atomics (if available) 
into Oracle.  We plan on testing some early prototypes of the these in 
the few months.


Send with invalidate is an exact match for our current RDS V3 rdma 
driver - and should be more efficient than the current background 
syncing of the tpt  to ensure keys are invalidated.


We intend on exposing the atomics via the RDS driver along with simple 
low level rdma operations to Oracle's internal clients. If Oracle is 
running over a transport which exports atomics and rdma - Oracle will 
see a dramatic performance boost for several database operations.


Roland Dreier wrote:

  We want to add send with invalidate  mask compare and swap.
  Eli will be able to send the patches next week and since they are
  small I think they can be in for 2.6.26

Send with invalidate should be OK.  Let's see about the masked atomics
stuff -- we have a ton of new verbs and I think we might want to slow
down and make sure it all makes sense.

  What about the split CQ for UD mode? It's improved the IPoIB
  performance for small messages significantly.

Oh yeah... I'll try to get that in too.

  mlx4- we plan to send patches for the low level driver only to enable
  mlx4_en. These only affect our low level driver.

No problem in principle, let's see the actual patches.

  I think we should try to push for XEC in 2.6.26 since there are
  already MPI implementation that use it and this ties them to use OFED
  only.
  Also this feature is stable and now being defined in IBTA
  Not taking it causing changes between OFED and the kernel and your
  libibverbs and we wish to avoid such gaps.
  Is there any thing we can do to help and make it into 2.6.26?

I don't have a good feeling that the user-kernel interface is well
thought out, so I want to consider XRC + ehca LL stuff + new iWARP verbs
and make sure we have something that makes sense for the future.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
  

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH] AMSO1100: Add check for NULL reply_msg in c2_intr

2008-04-04 Thread Tom Tucker


On Fri, 2008-04-04 at 12:22 -0700, Roland Dreier wrote:
  I don't think anyone has ever hit this bug, so it is a low priority in my 
  view. I also noticed that
   if we refactored vq_wait_for_reply that we could combine a common 
   
   if (!reply) {
  err = -ENOMEM;
  goto bail;
   }
   
   construct by guaranteeing that reply is non-null if vq_wait_for_reply 
 returns without
   an error. This patch, however, is much smaller. What do you think?
 
 Well, now is a good time to merge either version of the fix.  Would be
 nice to kill off one of the Coverity issues so I'm happy to take this.
 
 It's up to you how much effort you want to spend on this... the
 refactoring sounds nice but I think we're OK without it.
 

I'm up to my eyeballs right now. If it's ok with you I'd say defer the
refactoring.

  - R.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] InfiniBand/iWARP/RDMA merge plans for 2.6.26 (what's in infiniband.git)

2008-04-04 Thread Roland Dreier

  We are very interested in these new operations and are moving in the
  direction of tightly integrating RDMA along with atomics (if
  available) into Oracle.  We plan on testing some early prototypes of
  the these in the few months.

And you need the ConnectX-only masked atomics?  Or do the standard IB
atomic operations work for you?  Of course using atomics at all means
that things don't work on iWARP.

  Send with invalidate is an exact match for our current RDS V3 rdma
  driver - and should be more efficient than the current background
  syncing of the tpt  to ensure keys are invalidated.

How does send with invalidate interact with the current IB FMR stuff?
Seems that you would run into trouble keeping the state of the FMR
straight if the remote side is invalidating them.

Also I would think that send-with-invalidate would be much more
expensive than the current FMR method of batching up the invalidates,
since you don't get to amortize the cost of syncing up all the internal
HCA state.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH] AMSO1100: Add check for NULL reply_msg in c2_intr

2008-04-04 Thread Roland Dreier

  I'm up to my eyeballs right now. If it's ok with you I'd say defer the
  refactoring.

No problem, I'll queue this up and if you ever get time to work on
amso1100 you can send the refactoring.

But are you working on a pmtu fix?

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 7/10] IB/ipoib: Add ethtool support

2008-04-04 Thread Roland Dreier

thanks, applied
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] where to report bugs?

2008-04-04 Thread Hal Rosenstock

On Fri, 2008-04-04 at 15:24 -0400, Brian J. Murrell wrote:
 I'm wondering what the official mechanism is to report bugs?

http://www.openfabrics.org/bugzilla but that's usually used when email
is insufficient and some issue needs tracking but it's up to you.

-- Hal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 10/10] IB/mlx4: add support for modifying CQ parameters

2008-04-04 Thread Roland Dreier

thanks, I applied 8/10 and 9/10, and changed this one around a bit
before applying it... it seemed cleaner to me not to expose the CQ
context to the mlx4_ib driver.

For CQ resize we can just add a new mlx4_cq_resize() function in
mlx4_core, since the context parameters that matter there are completely
different.  (And there's no need for mlx4_ib to worry about either the
modify moderation or resize cases)

From a1f375e52ce0b39bebaa27adc6d3724816f7e963 Mon Sep 17 00:00:00 2001
From: Eli Cohen [EMAIL PROTECTED]
Date: Mon, 17 Mar 2008 17:24:25 +0200
Subject: [PATCH] IB/mlx4: Add support for modifying CQ moderation parameters

Signed-off-by: Eli Cohen [EMAIL PROTECTED]
Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
 drivers/infiniband/hw/mlx4/cq.c  |8 
 drivers/infiniband/hw/mlx4/main.c|1 +
 drivers/infiniband/hw/mlx4/mlx4_ib.h |1 +
 drivers/net/mlx4/cq.c|   31 +++
 include/linux/mlx4/cmd.h |2 +-
 include/linux/mlx4/cq.h  |3 +++
 6 files changed, 45 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 7d70af7..e4fb64b 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -85,6 +85,14 @@ static struct mlx4_cqe *next_cqe_sw(struct mlx4_ib_cq *cq)
return get_sw_cqe(cq, cq-mcq.cons_index);
 }
 
+int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period)
+{
+   struct mlx4_ib_cq *mcq = to_mcq(cq);
+   struct mlx4_ib_dev *dev = to_mdev(cq-device);
+
+   return mlx4_cq_modify(dev-dev, mcq-mcq, cq_count, cq_period);
+}
+
 struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int 
vector,
struct ib_ucontext *context,
struct ib_udata *udata)
diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index e9330a0..76dd45c 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -609,6 +609,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
ibdev-ib_dev.post_send = mlx4_ib_post_send;
ibdev-ib_dev.post_recv = mlx4_ib_post_recv;
ibdev-ib_dev.create_cq = mlx4_ib_create_cq;
+   ibdev-ib_dev.modify_cq = mlx4_ib_modify_cq;
ibdev-ib_dev.destroy_cq= mlx4_ib_destroy_cq;
ibdev-ib_dev.poll_cq   = mlx4_ib_poll_cq;
ibdev-ib_dev.req_notify_cq = mlx4_ib_arm_cq;
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h 
b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 3f8bd0a..ef8ad96 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -254,6 +254,7 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
  struct ib_udata *udata);
 int mlx4_ib_dereg_mr(struct ib_mr *mr);
 
+int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period);
 struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int 
vector,
struct ib_ucontext *context,
struct ib_udata *udata);
diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c
index d4441fe..00a270b 100644
--- a/drivers/net/mlx4/cq.c
+++ b/drivers/net/mlx4/cq.c
@@ -121,6 +121,13 @@ static int mlx4_SW2HW_CQ(struct mlx4_dev *dev, struct 
mlx4_cmd_mailbox *mailbox,
MLX4_CMD_TIME_CLASS_A);
 }
 
+static int mlx4_MODIFY_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox 
*mailbox,
+int cq_num, u32 opmod)
+{
+   return mlx4_cmd(dev, mailbox-dma, cq_num, opmod, MLX4_CMD_MODIFY_CQ,
+   MLX4_CMD_TIME_CLASS_A);
+}
+
 static int mlx4_HW2SW_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox 
*mailbox,
 int cq_num)
 {
@@ -129,6 +136,30 @@ static int mlx4_HW2SW_CQ(struct mlx4_dev *dev, struct 
mlx4_cmd_mailbox *mailbox,
MLX4_CMD_TIME_CLASS_A);
 }
 
+int mlx4_cq_modify(struct mlx4_dev *dev, struct mlx4_cq *cq,
+  u16 count, u16 period)
+{
+   struct mlx4_cmd_mailbox *mailbox;
+   struct mlx4_cq_context *cq_context;
+   int err;
+
+   mailbox = mlx4_alloc_cmd_mailbox(dev);
+   if (IS_ERR(mailbox))
+   return PTR_ERR(mailbox);
+
+   cq_context = mailbox-buf;
+   memset(cq_context, 0, sizeof *cq_context);
+
+   cq_context-cq_max_count = cpu_to_be16(count);
+   cq_context-cq_period= cpu_to_be16(period);
+
+   err = mlx4_MODIFY_CQ(dev, mailbox, cq-cqn, 1);
+
+   mlx4_free_cmd_mailbox(dev, mailbox);
+   return err;
+}
+EXPORT_SYMBOL_GPL(mlx4_cq_modify);
+
 int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
  struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq)
 {
diff --git a/include/linux/mlx4/cmd.h b/include/linux/mlx4/cmd.h
index 7d1eaa9..77323a7

[ofa-general] MVAPICH2 crashes on mixed fabric

2008-04-04 Thread Mike Heinz

Hey, all, I'm not sure if this is a known bug or some sort of limitation
I'm unaware of, but I've been building and testing with the OFED 1.3 GA
release on a small fabric that has a mix of Arbel-based and newer
Connect-X HCAs.
 
What I've discovered is that mvapich and openmpi work fine across the
entire fabric, but mvapich2 crashes when I use a mix of Arbels and
Connect-X. The errors vary depending on the test program but here's an
example:
 
[EMAIL PROTECTED] IMB-3.0]$ mpirun -n 5 ./IMB-MPI1
.
.
.
(output snipped)
.
.
.

#---
--
# Benchmarking Sendrecv
# #processes = 2
# ( 3 additional processes waiting in MPI_Barrier)
#---
--
   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
Mbytes/sec
0 1000 3.51 3.51 3.51
0.00
1 1000 3.63 3.63 3.63
0.52
2 1000 3.67 3.67 3.67
1.04
4 1000 3.64 3.64 3.64
2.09
8 1000 3.67 3.67 3.67
4.16
   16 1000 3.67 3.67 3.67
8.31
   32 1000 3.74 3.74 3.74
16.32
   64 1000 3.90 3.90 3.90
31.28
  128 1000 4.75 4.75 4.75
51.39
  256 1000 5.21 5.21 5.21
93.79
  512 1000 5.96 5.96 5.96
163.77
 1024 1000 7.88 7.89 7.89
247.54
 2048 100011.4211.4211.42
342.00
 4096 100015.3315.3315.33
509.49
 8192 100022.1922.2022.20
703.83
16384 100034.5734.5734.57
903.88
32768 100051.3251.3251.32
1217.94
65536  64085.8085.8185.80
1456.74
   131072  320   155.23   155.24   155.24
1610.40
   262144  160   301.84   301.86   301.85
1656.39
   524288   80   598.62   598.69   598.66
1670.31
  1048576   40  1175.22  1175.30  1175.26
1701.69
  2097152   20  2309.05  2309.05  2309.05
1732.32
  4194304   10  4548.72  4548.98  4548.85
1758.64
[0] Abort: Got FATAL event 3
 at line 796 in file ibv_channel_manager.c
rank 0 in job 1  compute-0-0.local_36049   caused collective abort of
all ranks
  exit status of rank 0: killed by signal 9

If, however, I define my mpdring to contain only Connect-X systems OR
only Arbel systems, IMB-MPI1 runs to completion.
 
Can any suggest a workaround or is this a real bug with mvapich2?
 
--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH] mmu notifier #v11

2008-04-04 Thread Andrea Arcangeli

This should guarantee that nobody can register when any of the mmu
notifiers is running avoiding all the races including guaranteeing
range_start not to be missed. I'll adapt the other patches to provide
the sleeping-feature on top of this (only needed by XPMEM) soon. KVM
seems to run fine on top of this one.

Andrew can you apply this to -mm?

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1050,6 +1050,9 @@
   unsigned long addr, unsigned long len,
   unsigned long flags, struct page **pages);
 
+extern void mm_lock(struct mm_struct *mm);
+extern void mm_unlock(struct mm_struct *mm);
+
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned 
long, unsigned long, unsigned long);
 
 extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -225,6 +225,9 @@
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
struct mem_cgroup *mem_cgroup;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+   struct hlist_head mmu_notifier_list;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,175 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include linux/list.h
+#include linux/spinlock.h
+#include linux/mm_types.h
+
+struct mmu_notifier;
+struct mmu_notifier_ops;
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier_ops {
+   /*
+* Called when nobody can register any more notifier in the mm
+* and after the mn notifier has been disarmed already.
+*/
+   void (*release)(struct mmu_notifier *mn,
+   struct mm_struct *mm);
+
+   /*
+* clear_flush_young is called after the VM is
+* test-and-clearing the young/accessed bitflag in the
+* pte. This way the VM will provide proper aging to the
+* accesses to the page through the secondary MMUs and not
+* only to the ones through the Linux pte.
+*/
+   int (*clear_flush_young)(struct mmu_notifier *mn,
+struct mm_struct *mm,
+unsigned long address);
+
+   /*
+* Before this is invoked any secondary MMU is still ok to
+* read/write to the page previously pointed by the Linux pte
+* because the old page hasn't been freed yet.  If required
+* set_page_dirty has to be called internally to this method.
+*/
+   void (*invalidate_page)(struct mmu_notifier *mn,
+   struct mm_struct *mm,
+   unsigned long address);
+
+   /*
+* invalidate_range_start() and invalidate_range_end() must be
+* paired. Multiple invalidate_range_start/ends may be nested
+* or called concurrently.
+*/
+   void (*invalidate_range_start)(struct mmu_notifier *mn,
+  struct mm_struct *mm,
+  unsigned long start, unsigned long end);
+   void (*invalidate_range_end)(struct mmu_notifier *mn,
+struct mm_struct *mm,
+unsigned long start, unsigned long end);
+};
+
+struct mmu_notifier {
+   struct hlist_node hlist;
+   const struct mmu_notifier_ops *ops;
+};
+
+static inline int mm_has_notifiers(struct mm_struct *mm)
+{
+   return unlikely(!hlist_empty(mm-mmu_notifier_list));
+}
+
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+extern void __mmu_notifier_release(struct mm_struct *mm);
+extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+ unsigned long address);
+extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+ unsigned long address);
+extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+   if (mm_has_notifiers(mm))
+   __mmu_notifier_release(mm);
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+ unsigned long address)
+{
+   if (mm_has_notifiers(mm))
+   return

Re: [ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?

2008-04-04 Thread Or Gerlitz

On Fri, Apr 4, 2008 at 7:06 PM, Roland Dreier [EMAIL PROTECTED] wrote:
   - Don't use IB-specific features (atomics, immediate data)

and don't use RNRs as a means for HW based flow control mechanism.
The current RDS implementation
does not have a SW based flow control but rather does some sort of
back pressure through SW based congestion
management.  I think that to some extent it relies on RNRs which don't
exist under iWARP.

Or.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?

2008-04-04 Thread Or Gerlitz

On Fri, Apr 4, 2008 at 5:41 PM, Steve Wise [EMAIL PROTECTED] wrote:
  We won't be in Sonoma, but perhaps Jon can email some info to the list on
 what we've done to-date for open mpi.

This would be very much helpful, best if done before Monday so we can
discuss there the RDS port with the maintainer.
Jon - any chance you will be able to send something (even raw, sketch)?

Or.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?

2008-04-04 Thread Richard Frank

Hmmm - so what happens with IWARP NIC when no buffer is posted on recv q 
and a message arrives ?



Or Gerlitz wrote:

On Fri, Apr 4, 2008 at 7:06 PM, Roland Dreier [EMAIL PROTECTED] wrote:
  

  - Don't use IB-specific features (atomics, immediate data)



and don't use RNRs as a means for HW based flow control mechanism.
The current RDS implementation
does not have a SW based flow control but rather does some sort of
back pressure through SW based congestion
management.  I think that to some extent it relies on RNRs which don't
exist under iWARP.

Or.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
  

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?

2008-04-04 Thread Richard Frank

How about a pointer to an IWARP spec - so we can sort out all the 
details.../ implications...to RDS.


Or Gerlitz wrote:

On Fri, Apr 4, 2008 at 5:41 PM, Steve Wise [EMAIL PROTECTED] wrote:
  

 We won't be in Sonoma, but perhaps Jon can email some info to the list on
what we've done to-date for open mpi.



This would be very much helpful, best if done before Monday so we can
discuss there the RDS port with the maintainer.
Jon - any chance you will be able to send something (even raw, sketch)?

Or.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
  

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?

2008-04-04 Thread Or Gerlitz

On Sat, Apr 5, 2008 at 12:27 AM, Richard Frank [EMAIL PROTECTED] wrote:
 Hmmm - so what happens with IWARP NIC when no buffer is posted on recv q and
 a message arrives ?

I am quite sure the L2 ethernet HW just drops it, but you better
verify this with an iWARP HW provider.

Or.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] where to report bugs?

2008-04-04 Thread Ira Weiny

On Fri, 04 Apr 2008 15:24:28 -0400
Brian J. Murrell [EMAIL PROTECTED] wrote:

 I'm wondering what the official mechanism is to report bugs?  Just about
 anything I'm going to find is likely to be limited to build and
 installation bugs, like this one...
 
 In infiniband-diags-1.3.6/Makefile.am we have the line:
 
 INCLUDES = -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband
 
 This is assuming that other OFED packages have been installed in the
 general system $PREFIX, usually /usr as $includedir should
 be /usr/include.
 
 But in particular, I have installed the opensm{,-devel} in an alternate
 location (i.e. PREFIX) and the infiniband-diags build fails with:

Are you specifying --prefix on the infiniband-diags configure?

I think that should work.

Ira

 
 if gcc -DHAVE_CONFIG_H -I. -I. -I. -I./include -I/usr/include 
 -I/usr/include/infiniband  
 -I/home/brian/ofed_1.3_integration/tree/usr/include -Wall  
 -I/home/brian/ofed_1.3_integration/tree/usr/include -O2 -g -fmessage-length=0 
 -D_FORTIFY_SOURCE=2 -MT src_ibnetdiscover-ibnetdiscover.o -MD -MP -MF 
 .deps/src_ibnetdiscover-ibnetdiscover.Tpo -c -o 
 src_ibnetdiscover-ibnetdiscover.o `test -f 'src/ibnetdiscover.c' || echo 
 './'`src/ibnetdiscover.c; \
 then mv -f .deps/src_ibnetdiscover-ibnetdiscover.Tpo 
 .deps/src_ibnetdiscover-ibnetdiscover.Po; else rm -f 
 .deps/src_ibnetdiscover-ibnetdiscover.Tpo; exit 1; fi
 In file included from src/ibnetdiscover.c:53:
 /home/brian/ofed_1.3_integration/tree/usr/include/infiniband/complib/cl_nodenamemap.h:39:29:
  error: complib/cl_qmap.h: No such file or directory
 In file included from src/ibnetdiscover.c:53:
 /home/brian/ofed_1.3_integration/tree/usr/include/infiniband/complib/cl_nodenamemap.h:45:
  error: expected specifier-qualifier-list before ‘cl_map_item_t’
 /home/brian/ofed_1.3_integration/tree/usr/include/infiniband/complib/cl_nodenamemap.h:51:
  error: expected specifier-qualifier-list before ‘cl_qmap_t’
 make[1]: *** [src_ibnetdiscover-ibnetdiscover.o] Error 1
 make[1]: Leaving directory `/home/brian/rpm/BUILD/infiniband-diags-1.3.6'
 
 On my system, with opensm-devel (and all other OFED RPMs) installed in
 an alternate PREFIX, the above list of include paths should be
 s#/usr/include/infiniband#PREFIX/include/infiniband#.
 
 It seems probably infiniband-diags needs to have the same --with-osm
 switch that ibutils has.
 
 b.
 
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] where to report bugs?

2008-04-04 Thread Brian J. Murrell

On Fri, 2008-04-04 at 13:31 -0700, Ira Weiny wrote:
 
 Are you specifying --prefix on the infiniband-diags configure?

Ahhh.  That would have the undesired effect of relocating my
infiniband-diags wherever I specify --prefix.  This is not quite what I
want.

The ugly details are about to come out.

The problem is that I am not setting a --prefix when I build any of the
prerequisite packages (i.e. opensm, the libraries it depends on, etc.)
as I want everything to actually have a /usr prefix, however for the
purposes of building this stack from the downloadable package of what's
basically SRPMs, I install the prerequisites into a temporary path.

So I have a dir ./tree/ in which I use rpm2cpio  $rpm | cpio -id to
roll the packages into and then point the various configure scripts to
using various --with-* options.  This method has worked so far for:

SRPMS/libibcommon-1.0.8-1.ofed1.3
SRPMS/libibumad-1.1.7-1.ofed1.3
SRPMS/opensm-3.1.10-1.ofed1.3
SRPMS/ibutils-1.2-1.ofed1.3
SRPMS/libibmad-1.1.6-1.ofed1.3

The overall problem is that I cannot taint my pristine build environment
by going along the normal process of build rpm, install it, build next
rpm, install it, etc., so I have to install prerequisite RPMs into a
sandbox and point subsequent users (in the build process) of it into the
sandbox.

b.



signature.asc
Description: This is a digitally signed message part
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] ofed works on kernels with 64Kbyte pages?

2008-04-04 Thread akepner


I know it's a long shot, but has anyone tried using OFED on
a kernel with 64Kbyte pages?

SGI would like to support that, but I've gotten reports that
something is not working (e.g., ib_rdma_bw doesn't work on 
an ia64 kernel with 64Kb pages). This is with the mthca driver, 
fwiw.

Unfortunately a conspiracy of h/w prevents me from reproducing
this right now, so I don't have more details. But I'd be very
curious to know if anyone can verify that OFED does/doesn't
work with 64Kbyte pages.

-- 
Arthur

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?

2008-04-04 Thread Roland Dreier

  How about a pointer to an IWARP spec - so we can sort out all the
  details.../ implications...to RDS.

www.rdmaconsortium.org has most of it... the verbs are at:

http://www.rdmaconsortium.org/home/draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf

the iWARP RDMA protocol is RFC 5040 et al:

http://www.ietf.org/rfc/rfc5040.txt

(the next few RFCs have lower-level details)
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: Has anyone tried running RDS over 10GE / IWARP NICs ?

2008-04-04 Thread Roland Dreier

   Hmmm - so what happens with IWARP NIC when no buffer is posted on recv q 
   and
   a message arrives ?
  
  I am quite sure the L2 ethernet HW just drops it, but you better
  verify this with an iWARP HW provider.

Why would it be dropped at L2?  What I believe will happen is that it
will generate an error at the DDP layer that will probably result in the
connection being closed.  Section 7.1 of RFC 5041 says:

   For non-zero-length Untagged DDP Segments, the DDP Segment MUST be
   validated before Placement by verifying:

[untagged DDP segments are incoming send data, as vs. tagged RDMA
operations]

   2.  The QN and MSN have an associated buffer that allows Placement of
   the payload.

   Implementers' note: DDP implementations SHOULD consider lack of
   an associated buffer as a system fault.  DDP implementations MAY
   try to recover from the system fault using LLP means in a ULP-
   transparent way.  DDP implementations SHOULD NOT permit system
   faults to occur repeatedly or frequently.  If there is not an
   associated buffer, DDP implementations MAY choose to disable the
   stream for the reception and report an error to the ULP at the
   Data Sink.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] ofed works on kernels with 64Kbyte pages?

2008-04-04 Thread Roland Dreier

  I know it's a long shot, but has anyone tried using OFED on
  a kernel with 64Kbyte pages?
  
  SGI would like to support that, but I've gotten reports that
  something is not working (e.g., ib_rdma_bw doesn't work on 
  an ia64 kernel with 64Kb pages). This is with the mthca driver, 
  fwiw.
  
  Unfortunately a conspiracy of h/w prevents me from reproducing
  this right now, so I don't have more details. But I'd be very
  curious to know if anyone can verify that OFED does/doesn't
  work with 64Kbyte pages.

I don't know about OFED, but I've tried various things on 64KB PAGE_SIZE
systems and it seems to work.  It wouldn't surprise me if there are
issues since the drivers and firmware gets a lot less testing in such
situations but it should work -- I'd be happy to help debug if anyone
has concrete problems.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] where to report bugs?

2008-04-04 Thread Ira Weiny

On Fri, 04 Apr 2008 16:43:07 -0400
Brian J. Murrell [EMAIL PROTECTED] wrote:

 On Fri, 2008-04-04 at 13:31 -0700, Ira Weiny wrote:
  
  Are you specifying --prefix on the infiniband-diags configure?
 
 Ahhh.  That would have the undesired effect of relocating my
 infiniband-diags wherever I specify --prefix.  This is not quite what I
 want.
 
 The ugly details are about to come out.
 
 The problem is that I am not setting a --prefix when I build any of the
 prerequisite packages (i.e. opensm, the libraries it depends on, etc.)
 as I want everything to actually have a /usr prefix, however for the
 purposes of building this stack from the downloadable package of what's
 basically SRPMs, I install the prerequisites into a temporary path.
 
 So I have a dir ./tree/ in which I use rpm2cpio  $rpm | cpio -id to
 roll the packages into and then point the various configure scripts to
 using various --with-* options.  This method has worked so far for:
 
 SRPMS/libibcommon-1.0.8-1.ofed1.3
 SRPMS/libibumad-1.1.7-1.ofed1.3
 SRPMS/opensm-3.1.10-1.ofed1.3
 SRPMS/ibutils-1.2-1.ofed1.3
 SRPMS/libibmad-1.1.6-1.ofed1.3
 
 The overall problem is that I cannot taint my pristine build environment
 by going along the normal process of build rpm, install it, build next
 rpm, install it, etc., so I have to install prerequisite RPMs into a
 sandbox and point subsequent users (in the build process) of it into the
 sandbox.
 

So I guess you want something like:

export CPPFLAGS=-Isandbox_dir/include

Before you do the configure and build?

Ira

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 17/20] IB/ipath - user mode send DMA

2008-04-04 Thread Roland Dreier

By the way...

  +int ipath_user_sdma_pkt_sent(const struct ipath_user_sdma_queue *pq,
  + u32 counter)
  +{
  +const u32 scounter = ipath_user_sdma_complete_counter(pq);
  +const s32 dcounter = scounter - counter;
  +
  +return dcounter = 0;
  +}

I don't see this called anywhere... should I just delete it?
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] where to report bugs?

2008-04-04 Thread Brian J. Murrell

On Fri, 2008-04-04 at 14:06 -0700, Ira Weiny wrote:
 So I guess you want something like:
 
 export CPPFLAGS=-Isandbox_dir/include

CPPFLAGS or CFLAGS?  I could see it being the former but I used the
latter.

 
 Before you do the configure and build?

That is in effect exactly what I did to deal with this issue.  I just
didn't find it very elegant.  But if that is how the package is meant to
operate, that is fine.  If it were CFLAGS you were promoting the setting
of I would be a bit more sticky because RPM wants to have the CFLAGS for
it's own use:

$ rpm --eval=%configure

  CFLAGS=${CFLAGS:--O2 -g -fmessage-length=0 -D_FORTIFY_SOURCE=2} ; export 
CFLAGS ; 
  CXXFLAGS=${CXXFLAGS:--O2 -g -fmessage-length=0 -D_FORTIFY_SOURCE=2} ; 
export CXXFLAGS ; 
  FFLAGS=${FFLAGS:--O2 -g -fmessage-length=0 -D_FORTIFY_SOURCE=2} ; export 
FFLAGS ; 
  ./configure --host=x86_64-suse-linux --build=x86_64-suse-linux \
--target=x86_64-suse-linux \
--program-prefix= \
...

And while, yes, you can override CFLAGS and the %configure macro will
use it, I'd rather defer the CFLAGS to whatever the vendor has put into
the RPM macros file(s).

b.



signature.asc
Description: This is a digitally signed message part
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 19/20] IB/ipath - add calls to new 7220 code and enable in build

2008-04-04 Thread Roland Dreier

  +enum ib_rate ipath_mult_to_ib_rate(unsigned mult)
  +{
  +switch (mult) {
  +case 8:  return IB_RATE_2_5_GBPS;
  +case 4:  return IB_RATE_5_GBPS;
  +case 2:  return IB_RATE_10_GBPS;
  +case 1:  return IB_RATE_20_GBPS;
  +default: return IB_RATE_PORT_CURRENT;
  +}
  +}

Looks suspiciously like a copy of the existing mult_to_ib_rate() except
it handles fewer cases... is there a reason to copy this?

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 17/20] IB/ipath - user mode send DMA

2008-04-04 Thread Roland Dreier

  +void ipath_user_sdma_set_complete_counter(struct ipath_user_sdma_queue *pq,
  +  u32 c)
  +{
  +pq-sent_counter = c;
  +}

This is only used in one file... OK to make it static?
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 1/1 v1] MLX4: Added resize_cq capability.

2008-04-04 Thread Roland Dreier

Thanks, I applied this with a lot of changes.  Some comments:

   entries  = roundup_pow_of_two(entries + 1);

your patch was corrupted in a very strange way... the context lines had
two spaces instead of one at the beginning.  I just deleted the extra
space by hand.

  +err = mlx4_alloc_cq_buf(dev, cq-resize_buf-buf, entries);
  +if (err) {
  +spin_lock_irq(cq-lock);
  +kfree(cq-resize_buf);
  +cq-resize_buf = NULL;
  +spin_unlock_irq(cq-lock);
  +goto out;
  +}

  +err_buf:
  +if (cq-resize_buf) {
  +if (!ibcq-uobject)
  +mlx4_free_cq_buf(dev, cq-resize_buf-buf,
  + cq-resize_buf-cqe);
  +
  +spin_lock_irq(cq-lock);
  +kfree(cq-resize_buf);
  +cq-resize_buf = NULL;
  +spin_unlock_irq(cq-lock);
  +}

Why do we need the spinlock in these places?  There's no way for this to
race with mlx4_ib_poll_one() is there, since that should never see the
RESIZE CQE?  (If there is such a race, then we're in trouble even with
the lock, since we're aborting the resize, and the poll code shouldn't
swap the buffers)

Also I got rid of the duplicated code to allocate buffers and get
userspace buffers, so that the allocate and resize paths use the same
code.  And I cleaned up some other stuff.

So please review/test my work to make sure I didn't break your patch...

---
 drivers/infiniband/hw/mlx4/cq.c  |  292 ++
 drivers/infiniband/hw/mlx4/main.c|2 +
 drivers/infiniband/hw/mlx4/mlx4_ib.h |9 +
 drivers/net/mlx4/cq.c|   28 
 include/linux/mlx4/cq.h  |2 +
 5 files changed, 300 insertions(+), 33 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index e4fb64b..3557e7e 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -93,6 +93,74 @@ int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 
cq_period)
return mlx4_cq_modify(dev-dev, mcq-mcq, cq_count, cq_period);
 }
 
+static int mlx4_ib_alloc_cq_buf(struct mlx4_ib_dev *dev, struct mlx4_ib_cq_buf 
*buf, int nent)
+{
+   int err;
+
+   err = mlx4_buf_alloc(dev-dev, nent * sizeof(struct mlx4_cqe),
+PAGE_SIZE * 2, buf-buf);
+
+   if (err)
+   goto out;
+
+   err = mlx4_mtt_init(dev-dev, buf-buf.npages, buf-buf.page_shift,
+   buf-mtt);
+   if (err)
+   goto err_buf;
+
+   err = mlx4_buf_write_mtt(dev-dev, buf-mtt, buf-buf);
+   if (err)
+   goto err_mtt;
+
+   return 0;
+
+err_mtt:
+   mlx4_mtt_cleanup(dev-dev, buf-mtt);
+
+err_buf:
+   mlx4_buf_free(dev-dev, nent * sizeof(struct mlx4_cqe),
+ buf-buf);
+
+out:
+   return err;
+}
+
+static void mlx4_ib_free_cq_buf(struct mlx4_ib_dev *dev, struct mlx4_ib_cq_buf 
*buf, int cqe)
+{
+   mlx4_buf_free(dev-dev, (cqe + 1) * sizeof(struct mlx4_cqe), buf-buf);
+}
+
+static int mlx4_ib_get_cq_umem(struct mlx4_ib_dev *dev, struct ib_ucontext 
*context,
+  struct mlx4_ib_cq_buf *buf, struct ib_umem 
**umem,
+  u64 buf_addr, int cqe)
+{
+   int err;
+
+   *umem = ib_umem_get(context, buf_addr, cqe * sizeof (struct mlx4_cqe),
+   IB_ACCESS_LOCAL_WRITE);
+   if (IS_ERR(*umem))
+   return PTR_ERR(*umem);
+
+   err = mlx4_mtt_init(dev-dev, ib_umem_page_count(*umem),
+   ilog2((*umem)-page_size), buf-mtt);
+   if (err)
+   goto err_buf;
+
+   err = mlx4_ib_umem_write_mtt(dev, buf-mtt, *umem);
+   if (err)
+   goto err_mtt;
+
+   return 0;
+
+err_mtt:
+   mlx4_mtt_cleanup(dev-dev, buf-mtt);
+
+err_buf:
+   ib_umem_release(*umem);
+
+   return err;
+}
+
 struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int 
vector,
struct ib_ucontext *context,
struct ib_udata *udata)
@@ -100,7 +168,6 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, 
int entries, int vector
struct mlx4_ib_dev *dev = to_mdev(ibdev);
struct mlx4_ib_cq *cq;
struct mlx4_uar *uar;
-   int buf_size;
int err;
 
if (entries  1 || entries  dev-dev-caps.max_cqes)
@@ -112,8 +179,10 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, 
int entries, int vector
 
entries  = roundup_pow_of_two(entries + 1);
cq-ibcq.cqe = entries - 1;
-   buf_size = entries * sizeof (struct mlx4_cqe);
+   mutex_init(cq-resize_mutex);
spin_lock_init(cq-lock);
+   cq-resize_buf = NULL;
+   cq-resize_umem = NULL;
 
if (context) {
struct mlx4_ib_create_cq ucmd;
@@ -123,21 +192,10 @@ struct

Re: [ofa-general] InfiniBand/iWARP/RDMA merge plans for 2.6.26 (what's in infiniband.git)

2008-04-04 Thread Richard Frank


Roland Dreier wrote:

  We are very interested in these new operations and are moving in the
  direction of tightly integrating RDMA along with atomics (if
  available) into Oracle.  We plan on testing some early prototypes of
  the these in the few months.

And you need the ConnectX-only masked atomics?  Or do the standard IB
atomic operations work for you?  Of course using atomics at all means
that things don't work on iWARP.

  

We specifically asked for the masked operations.

Yes, this means Oracle will not get the performance boost of atomics on 
IWARP - but we still get rdma - and that's a real win / benefit for 
Oracle today - and more so over the next few months.



  Send with invalidate is an exact match for our current RDS V3 rdma
  driver - and should be more efficient than the current background
  syncing of the tpt  to ensure keys are invalidated.

How does send with invalidate interact with the current IB FMR stuff?
Seems that you would run into trouble keeping the state of the FMR
straight if the remote side is invalidating them.

  
The model we implement is based on use once keys - we issue the key to 
the rdma server and want to toss it as soon as the rdma is complete. 
Today, we explicitly free the key after the rdma completes and we get a 
message from the rdma server - saying rdma is complete. If the key is 
auto invalidated by the recv'ing HCA then we do not need to do it in the 
driver... which also meanswe do not need to issue the sync tpts to force 
the HCA to be update its cache.


At least this is how I think it works - Olaf is the divine source here.


Also I would think that send-with-invalidate would be much more
expensive than the current FMR method of batching up the invalidates,
since you don't get to amortize the cost of syncing up all the internal
HCA state.

  
This is the one piece we do not know - our plans are to test this and 
see where the trade offs are. We will keep the current design / 
implementation to run over NICs that do not support send-with-invalidate.

 - R.
  

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 17/20] IB/ipath - user mode send DMA

2008-04-04 Thread Ralph Campbell

On Fri, 2008-04-04 at 14:12 -0700, Roland Dreier wrote:
 By the way...
 
   +int ipath_user_sdma_pkt_sent(const struct ipath_user_sdma_queue *pq,
   +   u32 counter)
   +{
   +  const u32 scounter = ipath_user_sdma_complete_counter(pq);
   +  const s32 dcounter = scounter - counter;
   +
   +  return dcounter = 0;
   +}
 
 I don't see this called anywhere... should I just delete it?

Yes. You can remove it.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 19/20] IB/ipath - add calls to new 7220 code and enable in build

2008-04-04 Thread Ralph Campbell

On Fri, 2008-04-04 at 14:15 -0700, Roland Dreier wrote:
   +enum ib_rate ipath_mult_to_ib_rate(unsigned mult)
   +{
   +  switch (mult) {
   +  case 8:  return IB_RATE_2_5_GBPS;
   +  case 4:  return IB_RATE_5_GBPS;
   +  case 2:  return IB_RATE_10_GBPS;
   +  case 1:  return IB_RATE_20_GBPS;
   +  default: return IB_RATE_PORT_CURRENT;
   +  }
   +}
 
 Looks suspiciously like a copy of the existing mult_to_ib_rate() except
 it handles fewer cases... is there a reason to copy this?
 
  - R.

It looks similar but the values are reversed. This is converting
the ib_rate enum to a multiplier of the DDR clock rate which is
used as a counter to delay packets. So IB_RATE_2_5_GBPS is 8
times slower than IB_RATE_20_GBPS. The standard functions map
the enum to a multiplier of the slowest rate so
IB_RATE_2_5_GBPS is one. If I used the standard functions, I would
still need a lookup table to map 8-1, 1-8, etc.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 17/20] IB/ipath - user mode send DMA

2008-04-04 Thread Ralph Campbell

On Fri, 2008-04-04 at 14:16 -0700, Roland Dreier wrote:
   +void ipath_user_sdma_set_complete_counter(struct ipath_user_sdma_queue 
 *pq,
   +u32 c)
   +{
   +  pq-sent_counter = c;
   +}
 
 This is only used in one file... OK to make it static?

Yes, thanks.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] ERR 0108: Unknown remote side

2008-04-04 Thread Bernd Schubert

On Fri, Apr 04, 2008 at 10:55:21AM -0700, Hal Rosenstock wrote:
 On Fri, 2008-04-04 at 11:47 +0200, Bernd Schubert wrote:
  Hello,
  
  opensm-3.2.1 logs some error messages like this:
  
  Apr 04 00:00:08 325114 [4580A960] 0x01 - 
  __osm_state_mgr_light_sweep_start: 
  ERR 0108: Unknown remote side for node 0
  x000b8c002ba2(SW_pfs1_leaf4) port 13. Adding to light sweep sampling 
  list
  Apr 04 00:00:08 325126 [4580A960] 0x01 - Directed Path Dump of 3 hop path:
  Path = 0,1,14,13
  
  
  From ibnetdiscover output I see port13 of this switch is a 
  switch-interconnect 
  (sorry, I don't know what the correct name/identifier for switches within 
  switches):
  
  [13]S-000b8c002bfa[13]# SW_pfs1_inter7 lid 263 
  4xSDR
  
  
  Apr 04 00:00:08 325219 [4580A960] 0x01 - 
  __osm_state_mgr_light_sweep_start: 
  ERR 0108: Unknown remote side for node 0
  x000b8c002bf9(SW_pfs1_inter6) port 9. Adding to light sweep sampling 
  list
  Apr 04 00:00:08 325234 [4580A960] 0x01 - Directed Path Dump of 2 hop path:
  Path = 0,1,18
  
  This is again an interconnection:
  
  [9] S-000b8c002b9e[15]# SW_pfs1_leaf1 lid 177 
  4xDDR
  
  
  Apr 04 00:00:08 325288 [4580A960] 0x01 - 
  __osm_state_mgr_light_sweep_start: 
  ERR 0108: Unknown remote side for node 0
  x000b8c002bfa(SW_pfs1_inter7) port 13. Adding to light sweep sampling 
  list
  Apr 04 00:00:08 325301 [4580A960] 0x01 - Directed Path Dump of 2 hop path:
  Path = 0,1,14
  
  
  And again an interconnection:
  
  [13]S-000b8c002ba2[13]# SW_pfs1_leaf4 lid 182 
  4xDDR
  
  
  All the other interconnections seem to be fine. 
 
 Any idea if OpenSM 3.1.10 has the same issue as 3.2.1 ?

Yes, from the log file I see these messages also did happen with opensm-3.1.10.

 
 Is this some large Flextronics switch ?

Again you are right, this is a Flextronics F-X430075, presently with 144 ports.


Thanks,
Bernd


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 19/20] IB/ipath - add calls to new 7220 code and enable in build

2008-04-04 Thread Roland Dreier

  It looks similar but the values are reversed. This is converting
  the ib_rate enum to a multiplier of the DDR clock rate which is
  used as a counter to delay packets. So IB_RATE_2_5_GBPS is 8
  times slower than IB_RATE_20_GBPS. The standard functions map
  the enum to a multiplier of the slowest rate so
  IB_RATE_2_5_GBPS is one. If I used the standard functions, I would
  still need a lookup table to map 8-1, 1-8, etc.

OK, got it thanks
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] linux-next: infiniband build failure

2008-04-04 Thread Stephen Rothwell

Hi Roland,

On Fri, 04 Apr 2008 08:47:29 -0700 Roland Dreier [EMAIL PROTECTED] wrote:

   drivers/infiniband/hw/ehca/ehca_reqs.c: In function 'ehca_write_swqe':
   drivers/infiniband/hw/ehca/ehca_reqs.c:191: error: 'const struct 
 ib_send_wr' has no member named 'imm_data'
 
 Oops, thanks, I forgot to run my cross-compile (and ehca is ppc only).
 
 Anyway, your fix is correct and I rolled it into my patch.

Thanks.
-- 
Cheers,
Stephen Rothwell[EMAIL PROTECTED]
http://www.canb.auug.org.au/~sfr/


pgpsWSCX32je9.pgp
Description: PGP signature
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH] mthca: update QP state after query QP

2008-04-04 Thread Roland Dreier

thanks, applied
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH] mlx4: update QP state after query QP

2008-04-04 Thread Roland Dreier

thanks, applied
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH] mmu notifier #v11

2008-04-04 Thread Christoph Lameter

I am always the guy doing the cleanup after Andrea it seems. Sigh.

Here is the mm_lock/mm_unlock logic separated out for easier review.
Adds some comments. Still objectionable is the multiple ways of
invalidating pages in #v11. Callout now has similar locking to emm.

From: Christoph Lameter [EMAIL PROTECTED]
Subject: mm_lock: Lock a process against reclaim

Provide a way to lock an mm_struct against reclaim (try_to_unmap
etc). This is necessary for the invalidate notifier approaches so
that they can reliably add and remove a notifier.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/mm.h |   10 
 mm/mmap.c  |   66 +
 2 files changed, 76 insertions(+)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h   2008-04-02 11:41:47.741678873 -0700
+++ linux-2.6/include/linux/mm.h2008-04-04 15:02:17.660504756 -0700
@@ -1050,6 +1050,16 @@ extern int install_special_mapping(struc
   unsigned long addr, unsigned long len,
   unsigned long flags, struct page **pages);
 
+/*
+ * Locking and unlocking an mm against reclaim.
+ *
+ * mm_lock will take mmap_sem writably (to prevent additional vmas from being
+ * added) and then take all mapping locks of the existing vmas. With that
+ * reclaim is effectively stopped.
+ */
+extern void mm_lock(struct mm_struct *mm);
+extern void mm_unlock(struct mm_struct *mm);
+
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned 
long, unsigned long, unsigned long);
 
 extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
Index: linux-2.6/mm/mmap.c
===
--- linux-2.6.orig/mm/mmap.c2008-04-04 14:55:03.477593980 -0700
+++ linux-2.6/mm/mmap.c 2008-04-04 14:59:05.505395402 -0700
@@ -2242,3 +2242,69 @@ int install_special_mapping(struct mm_st
 
return 0;
 }
+
+static void mm_lock_unlock(struct mm_struct *mm, int lock)
+{
+   struct vm_area_struct *vma;
+   spinlock_t *i_mmap_lock_last, *anon_vma_lock_last;
+
+   i_mmap_lock_last = NULL;
+   for (;;) {
+   spinlock_t *i_mmap_lock = (spinlock_t *) -1UL;
+   for (vma = mm-mmap; vma; vma = vma-vm_next)
+   if (vma-vm_file  vma-vm_file-f_mapping 
+   (unsigned long) i_mmap_lock 
+   (unsigned long)
+   vma-vm_file-f_mapping-i_mmap_lock 
+   (unsigned long)
+   vma-vm_file-f_mapping-i_mmap_lock 
+   (unsigned long) i_mmap_lock_last)
+   i_mmap_lock =
+   vma-vm_file-f_mapping-i_mmap_lock;
+   if (i_mmap_lock == (spinlock_t *) -1UL)
+   break;
+   i_mmap_lock_last = i_mmap_lock;
+   if (lock)
+   spin_lock(i_mmap_lock);
+   else
+   spin_unlock(i_mmap_lock);
+   }
+
+   anon_vma_lock_last = NULL;
+   for (;;) {
+   spinlock_t *anon_vma_lock = (spinlock_t *) -1UL;
+   for (vma = mm-mmap; vma; vma = vma-vm_next)
+   if (vma-anon_vma 
+   (unsigned long) anon_vma_lock 
+   (unsigned long) vma-anon_vma-lock 
+   (unsigned long) vma-anon_vma-lock 
+   (unsigned long) anon_vma_lock_last)
+   anon_vma_lock = vma-anon_vma-lock;
+   if (anon_vma_lock == (spinlock_t *) -1UL)
+   break;
+   anon_vma_lock_last = anon_vma_lock;
+   if (lock)
+   spin_lock(anon_vma_lock);
+   else
+   spin_unlock(anon_vma_lock);
+   }
+}
+
+/*
+ * This operation locks against the VM for all pte/vma/mm related
+ * operations that could ever happen on a certain mm. This includes
+ * vmtruncate, try_to_unmap, and all page faults. The holder
+ * must not hold any mm related lock. A single task can't take more
+ * than one mm lock in a row or it would deadlock.
+ */
+void mm_lock(struct mm_struct * mm)
+{
+   down_write(mm-mmap_sem);
+   mm_lock_unlock(mm, 1);
+}
+
+void mm_unlock(struct mm_struct *mm)
+{
+   mm_lock_unlock(mm, 0);
+   up_write(mm-mmap_sem);
+}

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [PATCH 2 of 2] mlx4: update module version and release date (for 2.6.25)

2008-04-04 Thread Roland Dreier

thanks, applied both this and mthca equivalent
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] XmtDiscards

2008-04-04 Thread Bernd Schubert

Hello,

after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten 
much better there, at least no further RcvSwRelayErrors, even when the 
cluster is in idle state and so far also no SymbolErrors, which we also have 
seens before.

However, after I just started a lustre stress test on 50 clients (to a lustre 
storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports about 
9000 XmtDiscards within 30 minutes.

Searching for this error I find This is a symptom of congestion and may 
require tweaking either HOQ or switch lifetime values. 
Well, I have to admit I neither know what HOQ is, nor do I know how to tweak 
it. I also do not have an idea to set switch lifetime values.  I guess this 
isn't related to the opensm timeout option, is it?

Hmm, I just found a cisci pdf describing how to set the lifetime on these 
switches, but is this also possible on Flextronics switches?


Thanks for any help,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH] mlx4: make firmware diagnostic counters available via sysfs

2008-04-04 Thread Roland Dreier

  +int mlx4_query_diag_counters(struct mlx4_dev *dev, int array_length,
  + int in_modifier, unsigned int in_offset[],
  + u32 counter_out[])
  +{
  +struct mlx4_cmd_mailbox *mailbox;
  +u32 *outbox;
  +u32 op_modifer = (u32)in_modifier;

This coding style looks strange to me... you have an int parameter
in_modifier that is not used for anything except to assign it to a u32
op_modifer [sic] variable with a (u32) cast that doesn't do anything.

Why not just have op_modifier be the parameter in the first place?

Also the array_length stuff looks kind of funny since you only ever pass
in a value of 1... why not just pass in int offset and u32 *counter?

  +/* clear counters file, can't read it */
  +if(offset  0)
  +return sprintf(buf,This file is write only\n);

Why not just set the permissions on the file so it can't be opened for
reading?  This just looks like a recipe for making userspace code go
crazy on unexpected input.

Also watch out for the space in if (

And if I'm understanding correctly, you use a magic offset of -1 for the
clear_diag attribute that makes mlx4_query_diag_counters() read before
the beginning of the output mailbox.

  +err_diag:
  +ib_unregister_device(ibdev-ib_dev);
  +
   err_reg:
   ib_unregister_device(ibdev-ib_dev);

This doesn't look like a good idea.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] XmtDiscards

2008-04-04 Thread Boris Shpolyansky

Hi Bernd,

You can configure the HOQ (Head-Of-Queue-Lifetime) value programmed in
any switch in the fabric managed by OpenSM following these simple steps:

1. Stop the SM
/etc/init.d/opensmd stop

2. Run the SM manually with the -c option (to dump its default
configuration to a file)
opensm -c

3. Kill the SM with ^C

4. The configuration is saved in /var/cache/opensm/opensm.opts. Open the
file and look for head_of_queue_lifetime. Change the value and save the
file.

5. Restart the SM
/etc/init.d/opensmd start

P.S. You might find 'opensm -h' and 'man opensm' useful.



Hope this helps,

Boris Shpolyansky
Sr. Member of Technical Staff
Applications
Mellanox Technologies Inc.
2900 Stender Way
Santa Clara, CA 95054
Tel.: (408) 916 0014
Fax: (408) 970 3403
Cell: (408) 834 9365
www.mellanox.com


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Bernd
Schubert
Sent: Friday, April 04, 2008 3:13 PM
To: OpenIB
Subject: [ofa-general] XmtDiscards

Hello,

after I upgraded one of our clusters to opensm-3.2.1 it seems to have
gotten much better there, at least no further RcvSwRelayErrors, even
when the cluster is in idle state and so far also no SymbolErrors, which
we also have seens before.

However, after I just started a lustre stress test on 50 clients (to a
lustre storage system with 20 OSS servers and 60 OSTs), ibcheckerrors
reports about 9000 XmtDiscards within 30 minutes.

Searching for this error I find This is a symptom of congestion and may
require tweaking either HOQ or switch lifetime values. 
Well, I have to admit I neither know what HOQ is, nor do I know how to
tweak it. I also do not have an idea to set switch lifetime values.  I
guess this isn't related to the opensm timeout option, is it?

Hmm, I just found a cisci pdf describing how to set the lifetime on
these switches, but is this also possible on Flextronics switches?


Thanks for any help,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] XmtDiscards

2008-04-04 Thread Ira Weiny

On Sat, 5 Apr 2008 00:12:39 +0200
Bernd Schubert [EMAIL PROTECTED] wrote:

 Hello,
 
 after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten 
 much better there, at least no further RcvSwRelayErrors, even when the 
 cluster is in idle state and so far also no SymbolErrors, which we also have 
 seens before.
 
 However, after I just started a lustre stress test on 50 clients (to a lustre 
 storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports about 
 9000 XmtDiscards within 30 minutes.

Yea, those are bad.

 
 Searching for this error I find This is a symptom of congestion and may 
 require tweaking either HOQ or switch lifetime values. 
 Well, I have to admit I neither know what HOQ is, nor do I know how to tweak 
 it. I also do not have an idea to set switch lifetime values.  I guess this 
 isn't related to the opensm timeout option, is it?

Yes you should adjust these values.

 
 Hmm, I just found a cisci pdf describing how to set the lifetime on these 
 switches, but is this also possible on Flextronics switches?
 

I don't know about the Vendor SMs but in opensm look for the following options
in the opensm.opts file (Default path is: /var/cache/opensm):

   # The code of maximal time a packet can wait at the head of
   # transmission queue.
   # The actual time is 4.096usec * 2^head_of_queue_lifetime
   # The value 0x14 disables this mechanism
   head_of_queue_lifetime 0x12
   
   # The maximal time a packet can wait at the head of queue on
   # switch port connected to a CA or router port
   leaf_head_of_queue_lifetime 0x0c

Ira
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [patch 00/10] [RFC] EMM Notifier V3

2008-04-04 Thread Christoph Lameter

V2-V3:
- Fix rcu issues
- Fix emm_referenced handling
- Use Andrea's mm_lock/unlock to prevent registration races.
- Keep simple API since there does not seem to be a need to add additional
  callbacks (mm_lock does not require callbacks like emm_start/stop that
  I envisioned).
- Reduce CC list (the volume we are producing here must be annoying...).

V1-V2:
- Additional optimizations in the VM
- Convert vm spinlocks to rw sems.
- Add XPMEM driver (requires sleeping in callbacks)
- Add XPMEM example

This patch implements a simple callback for device drivers that establish
their own references to pages (KVM, GRU, XPmem, RDMA/Infiniband, DMA engines
etc). These references are unknown to the VM (therefore external).

With these callbacks it is possible for the device driver to release external
references when the VM requests it. This enables swapping, page migration and
allows support of remapping, permission changes etc etc for the externally
mapped memory.

With this functionality it becomes also possible to avoid pinning or mlocking
pages (commonly done to stop the VM from unmapping device mapped pages).

A device driver must subscribe to a process using

emm_register_notifier(struct emm_notifier *, struct mm_struct *)


The VM will then perform callbacks for operations that unmap or change
permissions of pages in that address space. When the process terminates
the callback function is called with emm_release.

Callbacks are performed before and after the unmapping action of the VM.

emm_invalidate_startbefore

emm_invalidate_end  after

The device driver must hold off establishing new references to pages
in the range specified between a callback with emm_invalidate_start and
the subsequent call with emm_invalidate_end set. This allows the VM to
ensure that no concurrent driver actions are performed on an address
range while performing remapping or unmapping operations.


This patchset contains additional modifications needed to ensure
that the callbacks can sleep. For that purpose two key locks in the vm
need to be converted to rw_sems. These patches are brand new, invasive
and need extensive discussion and evaluation.

The first patch alone may be applied if callbacks in atomic context are
sufficient for a device driver (likely the case for KVM and GRU and simple
DMA drivers).

Following the VM modifications is the XPMEM device driver that allows sharing
of memory between processes running on different instances of Linux. This is
also a prototype. It is known to run trivial sample programs included as the 
last
patch.


-- 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [patch 01/10] emm: mm_lock: Lock a process against reclaim

2008-04-04 Thread Christoph Lameter

Provide a way to lock an mm_struct against reclaim (try_to_unmap
etc). This is necessary for the invalidate notifier approaches so
that they can reliably add and remove a notifier.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/mm.h |   10 
 mm/mmap.c  |   66 +
 2 files changed, 76 insertions(+)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h   2008-04-02 11:41:47.741678873 -0700
+++ linux-2.6/include/linux/mm.h2008-04-04 15:02:17.660504756 -0700
@@ -1050,6 +1050,16 @@ extern int install_special_mapping(struc
   unsigned long addr, unsigned long len,
   unsigned long flags, struct page **pages);
 
+/*
+ * Locking and unlocking am mm against reclaim.
+ *
+ * mm_lock will take mmap_sem writably (to prevent additional vmas from being
+ * added) and then take all mapping locks of the existing vmas. With that
+ * reclaim is effectively stopped.
+ */
+extern void mm_lock(struct mm_struct *mm);
+extern void mm_unlock(struct mm_struct *mm);
+
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned 
long, unsigned long, unsigned long);
 
 extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
Index: linux-2.6/mm/mmap.c
===
--- linux-2.6.orig/mm/mmap.c2008-04-04 14:55:03.477593980 -0700
+++ linux-2.6/mm/mmap.c 2008-04-04 14:59:05.505395402 -0700
@@ -2242,3 +2242,69 @@ int install_special_mapping(struct mm_st
 
return 0;
 }
+
+static void mm_lock_unlock(struct mm_struct *mm, int lock)
+{
+   struct vm_area_struct *vma;
+   spinlock_t *i_mmap_lock_last, *anon_vma_lock_last;
+
+   i_mmap_lock_last = NULL;
+   for (;;) {
+   spinlock_t *i_mmap_lock = (spinlock_t *) -1UL;
+   for (vma = mm-mmap; vma; vma = vma-vm_next)
+   if (vma-vm_file  vma-vm_file-f_mapping 
+   (unsigned long) i_mmap_lock 
+   (unsigned long)
+   vma-vm_file-f_mapping-i_mmap_lock 
+   (unsigned long)
+   vma-vm_file-f_mapping-i_mmap_lock 
+   (unsigned long) i_mmap_lock_last)
+   i_mmap_lock =
+   vma-vm_file-f_mapping-i_mmap_lock;
+   if (i_mmap_lock == (spinlock_t *) -1UL)
+   break;
+   i_mmap_lock_last = i_mmap_lock;
+   if (lock)
+   spin_lock(i_mmap_lock);
+   else
+   spin_unlock(i_mmap_lock);
+   }
+
+   anon_vma_lock_last = NULL;
+   for (;;) {
+   spinlock_t *anon_vma_lock = (spinlock_t *) -1UL;
+   for (vma = mm-mmap; vma; vma = vma-vm_next)
+   if (vma-anon_vma 
+   (unsigned long) anon_vma_lock 
+   (unsigned long) vma-anon_vma-lock 
+   (unsigned long) vma-anon_vma-lock 
+   (unsigned long) anon_vma_lock_last)
+   anon_vma_lock = vma-anon_vma-lock;
+   if (anon_vma_lock == (spinlock_t *) -1UL)
+   break;
+   anon_vma_lock_last = anon_vma_lock;
+   if (lock)
+   spin_lock(anon_vma_lock);
+   else
+   spin_unlock(anon_vma_lock);
+   }
+}
+
+/*
+ * This operation locks against the VM for all pte/vma/mm related
+ * operations that could ever happen on a certain mm. This includes
+ * vmtruncate, try_to_unmap, and all page faults. The holder
+ * must not hold any mm related lock. A single task can't take more
+ * than one mm lock in a row or it would deadlock.
+ */
+void mm_lock(struct mm_struct * mm)
+{
+   down_write(mm-mmap_sem);
+   mm_lock_unlock(mm, 1);
+}
+
+void mm_unlock(struct mm_struct *mm)
+{
+   mm_lock_unlock(mm, 0);
+   up_write(mm-mmap_sem);
+}

-- 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [patch 06/10] emm: Convert anon_vma lock to rw_sem and refcount

2008-04-04 Thread Christoph Lameter

Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap and page_mkclean. It also
allows the calling of sleeping functions from reverse map traversal.

An additional complication is that rcu is used in some context to guarantee
the presence of the anon_vma while we acquire the lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

Issues:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration). There is also the more frequent processor
  change due to up_xxx letting waiting tasks run first.
  This results in f.e. the Aim9 brk performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/rmap.h |   20 ---
 mm/migrate.c |   26 ++---
 mm/mmap.c|   28 +-
 mm/rmap.c|   53 +--
 4 files changed, 73 insertions(+), 54 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===
--- linux-2.6.orig/include/linux/rmap.h 2008-04-04 15:09:45.403759876 -0700
+++ linux-2.6/include/linux/rmap.h  2008-04-04 15:16:54.318714568 -0700
@@ -25,7 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-   spinlock_t lock;/* Serialize access to vma list */
+   atomic_t refcount;  /* vmas on the list */
+   struct rw_semaphore sem;/* Serialize access to vma list */
struct list_head head;  /* List of private related vmas */
 };
 
@@ -43,18 +44,31 @@ static inline void anon_vma_free(struct 
kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+struct anon_vma *grab_anon_vma(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+   atomic_inc(anon_vma-refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+   if (atomic_dec_and_test(anon_vma-refcount))
+   anon_vma_free(anon_vma);
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma-anon_vma;
if (anon_vma)
-   spin_lock(anon_vma-lock);
+   down_write(anon_vma-sem);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma-anon_vma;
if (anon_vma)
-   spin_unlock(anon_vma-lock);
+   up_write(anon_vma-sem);
 }
 
 /*
Index: linux-2.6/mm/migrate.c
===
--- linux-2.6.orig/mm/migrate.c 2008-04-04 15:09:45.443760619 -0700
+++ linux-2.6/mm/migrate.c  2008-04-04 15:16:54.318714568 -0700
@@ -235,15 +235,16 @@ static void remove_anon_migration_ptes(s
return;
 
/*
-* We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+* We hold either the mmap_sem lock or a reference on the
+* anon_vma. So no need to call page_lock_anon_vma.
 */
anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-   spin_lock(anon_vma-lock);
+   down_read(anon_vma-sem);
 
list_for_each_entry(vma, anon_vma-head, anon_vma_node)
remove_migration_pte(vma, old, new);
 
-   spin_unlock(anon_vma-lock);
+   up_read(anon_vma-sem);
 }
 
 /*
@@ -623,7 +624,7 @@ static int unmap_and_move(new_page_t get
int rc = 0;
int *result = NULL;
struct page *newpage = get_new_page(page, private, result);
-   int rcu_locked = 0;
+   struct anon_vma *anon_vma = NULL;
int charge = 0;
 
if (!newpage)
@@ -647,16 +648,14 @@ static int unmap_and_move(new_page_t get
}
/*
 * By try_to_unmap(), page-mapcount goes down to 0 here. In this case,
-* we cannot notice that anon_vma is freed while we migrates a page.
+* we cannot notice that anon_vma is freed while we migrate a page.
 * This rcu_read_lock() delays freeing anon_vma pointer until the end
 * of migration. File cache pages are no

[ofa-general] [patch 04/10] emm: Convert i_mmap_lock to i_mmap_sem

2008-04-04 Thread Christoph Lameter

The conversion to a rwsem allows callbacks during rmap traversal
for files in a non atomic context. A rw style lock also allows concurrent
walking of the reverse map. This is fairly straightforward if one removes
pieces of the resched checking.

[Restarting unmapping is an issue to be discussed].

This slightly increases Aim9 performance results on an 8p.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 arch/x86/mm/hugetlbpage.c |4 ++--
 fs/hugetlbfs/inode.c  |4 ++--
 fs/inode.c|2 +-
 include/linux/fs.h|2 +-
 include/linux/mm.h|2 +-
 kernel/fork.c |4 ++--
 mm/filemap.c  |8 
 mm/filemap_xip.c  |4 ++--
 mm/fremap.c   |4 ++--
 mm/hugetlb.c  |   10 +-
 mm/memory.c   |   29 +
 mm/migrate.c  |4 ++--
 mm/mmap.c |   43 ++-
 mm/mremap.c   |4 ++--
 mm/rmap.c |   20 +---
 15 files changed, 66 insertions(+), 78 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c2008-04-02 11:41:47.601676490 
-0700
+++ linux-2.6/arch/x86/mm/hugetlbpage.c 2008-04-04 15:09:11.715211829 -0700
@@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str
if (!vma_shareable(vma, addr))
return;
 
-   spin_lock(mapping-i_mmap_lock);
+   down_read(mapping-i_mmap_sem);
vma_prio_tree_foreach(svma, iter, mapping-i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str
put_page(virt_to_page(spte));
spin_unlock(mm-page_table_lock);
 out:
-   spin_unlock(mapping-i_mmap_lock);
+   up_read(mapping-i_mmap_sem);
 }
 
 /*
Index: linux-2.6/fs/hugetlbfs/inode.c
===
--- linux-2.6.orig/fs/hugetlbfs/inode.c 2008-04-02 11:41:47.605676583 -0700
+++ linux-2.6/fs/hugetlbfs/inode.c  2008-04-04 15:09:11.743212273 -0700
@@ -454,10 +454,10 @@ static int hugetlb_vmtruncate(struct ino
pgoff = offset  PAGE_SHIFT;
 
i_size_write(inode, offset);
-   spin_lock(mapping-i_mmap_lock);
+   down_read(mapping-i_mmap_sem);
if (!prio_tree_empty(mapping-i_mmap))
hugetlb_vmtruncate_list(mapping-i_mmap, pgoff);
-   spin_unlock(mapping-i_mmap_lock);
+   up_read(mapping-i_mmap_sem);
truncate_hugepages(inode, offset);
return 0;
 }
Index: linux-2.6/fs/inode.c
===
--- linux-2.6.orig/fs/inode.c   2008-04-02 11:41:47.613676625 -0700
+++ linux-2.6/fs/inode.c2008-04-04 15:09:11.755212477 -0700
@@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode
INIT_LIST_HEAD(inode-i_devices);
INIT_RADIX_TREE(inode-i_data.page_tree, GFP_ATOMIC);
rwlock_init(inode-i_data.tree_lock);
-   spin_lock_init(inode-i_data.i_mmap_lock);
+   init_rwsem(inode-i_data.i_mmap_sem);
INIT_LIST_HEAD(inode-i_data.private_list);
spin_lock_init(inode-i_data.private_lock);
INIT_RAW_PRIO_TREE_ROOT(inode-i_data.i_mmap);
Index: linux-2.6/include/linux/fs.h
===
--- linux-2.6.orig/include/linux/fs.h   2008-04-02 11:41:47.621676899 -0700
+++ linux-2.6/include/linux/fs.h2008-04-04 15:09:11.755212477 -0700
@@ -503,7 +503,7 @@ struct address_space {
unsigned inti_mmap_writable;/* count VM_SHARED mappings */
struct prio_tree_root   i_mmap; /* tree of private and shared 
mappings */
struct list_headi_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-   spinlock_t  i_mmap_lock;/* protect tree, count, list */
+   struct rw_semaphore i_mmap_sem; /* protect tree, count, list */
unsigned inttruncate_count; /* Cover race condition with 
truncate */
unsigned long   nrpages;/* number of total pages */
pgoff_t writeback_index;/* writeback starts here */
Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h   2008-04-04 15:09:11.687211361 -0700
+++ linux-2.6/include/linux/mm.h2008-04-04 15:09:45.883767696 -0700
@@ -716,7 +716,7 @@ struct zap_details {
struct address_space *check_mapping;/* Check page-mapping if set */
pgoff_t first_index;/* Lowest page-index to unmap 
*/
pgoff_t last_index; /* Highest page-index to unmap 
*/
-   spinlock_t *i_mmap_lock;

[ofa-general] [patch 10/10] xpmem: Simple example

2008-04-04 Thread Christoph Lameter

A simple test program (well actually a pair).  They are fairly easy to use.

NOTE: the xpmem.h is copied from the kernel/drivers/misc/xp/xpmem.h
file.

Type make.  Then from one session, type ./A1.  Grab the first
line of output which should begin with ./A2 and paste the whole line
into a second session.  Paste as many times as you like.  Each pass will
increment the value one additional time.  When you are tired, hit enter
in the first window.  You should see the same value printed from A1 as
you most recently received from A2.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 xpmem_test/A1.c |   64 +
 xpmem_test/A2.c |   70 
 xpmem_test/Makefile |   14 +
 xpmem_test/xpmem.h  |  130 
 4 files changed, 278 insertions(+)

Index: linux-2.6/xpmem_test/A1.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6/xpmem_test/A1.c   2008-04-04 15:09:11.955215737 -0700
@@ -0,0 +1,64 @@
+/*
+ *  Simple test program.  Makes a segment then waits for an input line
+ * and finally prints the value of the first integer of that segment.
+ */
+
+#include errno.h
+#include fcntl.h
+#include stdio.h
+#include stdlib.h
+#include stropts.h
+#include sys/mman.h
+#include sys/stat.h
+#include sys/types.h
+#include unistd.h
+
+#include xpmem.h
+
+int xpmem_fd;
+
+int
+main(int argc, char **argv)
+{
+   char input[32];
+   struct xpmem_cmd_make make_info;
+   int *data_block;
+   int ret;
+   __s64 segid;
+
+   xpmem_fd = open(/dev/xpmem, O_RDWR);
+   if (xpmem_fd == -1) {
+   perror(Opening /dev/xpmem);
+   return -1;
+   }
+
+   data_block = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANONYMOUS, 0, 0);
+   if (data_block == MAP_FAILED) {
+   perror(Creating mapping.);
+   return -1;
+   }
+   data_block[0] = 1;
+
+   make_info.vaddr = (__u64) data_block;
+   make_info.size = getpagesize();
+   make_info.permit_type = XPMEM_PERMIT_MODE;
+   make_info.permit_value = (__u64) 0600;
+   ret = ioctl(xpmem_fd, XPMEM_CMD_MAKE, make_info);
+   if (ret != 0) {
+   perror(xpmem_make);
+   return -1;
+   }
+
+   segid = make_info.segid;
+   printf(./A2 %d %d %d %d\ndata_block[0] = %d\n,
+  (int)(segid  48  0x), (int)(segid  32  0x),
+  (int)(segid  16  0x), (int)(segid  0x),
+  data_block[0]);
+   printf(Waiting for input before exiting.\n);
+   fscanf(stdin, %s, input);
+
+   printf(data_block[0] = %d\n, data_block[0]);
+
+   return 0;
+}
Index: linux-2.6/xpmem_test/A2.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6/xpmem_test/A2.c   2008-04-04 15:09:11.955215737 -0700
@@ -0,0 +1,70 @@
+/*
+ * Simple test program that gets then attaches an xpmem segment identified
+ * on the command line then increments the first integer of that buffer by
+ * one and exits.
+ */
+
+#include errno.h
+#include fcntl.h
+#include stdio.h
+#include stdlib.h
+#include stropts.h
+#include sys/mman.h
+#include sys/stat.h
+#include sys/types.h
+#include unistd.h
+
+#include xpmem.h
+
+int xpmem_fd;
+
+int
+main(int argc, char **argv)
+{
+   int ret;
+   __s64 segid;
+   __s64 apid;
+   struct xpmem_cmd_get get_info;
+   struct xpmem_cmd_attach attach_info;
+   int *attached_buffer;
+
+   xpmem_fd = open(/dev/xpmem, O_RDWR);
+   if (xpmem_fd == -1) {
+   perror(Opening /dev/xpmem);
+   return -1;
+   }
+
+   segid = (__s64) atoi(argv[1])  48;
+   segid |= (__s64) atoi(argv[2])  32;
+   segid |= (__s64) atoi(argv[3])  16;
+   segid |= (__s64) atoi(argv[4]);
+   get_info.segid = segid;
+   get_info.flags = XPMEM_RDWR;
+   get_info.permit_type = XPMEM_PERMIT_MODE;
+   get_info.permit_value = (__u64) NULL;
+   ret = ioctl(xpmem_fd, XPMEM_CMD_GET, get_info);
+   if (ret != 0) {
+   perror(xpmem_get);
+   return -1;
+   }
+   apid = get_info.apid;
+
+   attach_info.apid = get_info.apid;
+   attach_info.offset = 0;
+   attach_info.size = getpagesize();
+   attach_info.vaddr = (__u64) NULL;
+   attach_info.fd = xpmem_fd;
+   attach_info.flags = 0;
+
+   ret = ioctl(xpmem_fd, XPMEM_CMD_ATTACH, attach_info);
+   if (ret != 0) {
+   perror(xpmem_attach);
+   return -1;
+   }
+
+   attached_buffer = (int *)attach_info.vaddr;
+   attached_buffer[0]++;
+
+   printf(Just incremented the value to %d\n, attached_buffer[0]);
+   return 0;
+}
Index: linux-2.6/xpmem_test/Makefile

[ofa-general] [patch 02/10] emm: notifier logic

2008-04-04 Thread Christoph Lameter

This patch implements a simple callback for device drivers that establish
their own references to pages (KVM, GRU, XPmem, RDMA/Infiniband, DMA engines
etc). These references are unknown to the VM (therefore external).

With these callbacks it is possible for the device driver to release external
references when the VM requests it. This enables swapping, page migration and
allows support of remapping, permission changes etc etc for externally
mapped memory.

With this functionality it becomes also possible to avoid pinning or mlocking
pages (commonly done to stop the VM from unmapping device mapped pages).

A device driver must subscribe to a process using

emm_register_notifier(struct emm_notifier *, struct mm_struct *)


The VM will then perform callbacks for operations that unmap or change
permissions of pages in that address space. When the process terminates
the callback function is called with emm_release.

Callbacks are performed before and after the unmapping action of the VM.

emm_invalidate_startbefore

emm_invalidate_end  after

The device driver must hold off establishing new references to pages
in the range specified between a callback with emm_invalidate_start and
the subsequent call with emm_invalidate_end set. This allows the VM to
ensure that no concurrent driver actions are performed on an address
range while performing remapping or unmapping operations.

Callbacks are mostly performed in a non atomic context. However, in
various places spinlocks are held to traverse rmaps. So this patch here
is only useful for those devices that can remove mappings in an atomic
context (f.e. KVM/GRU).

If the rmap spinlocks are converted to semaphores then all callbacks will
be performed in a nonatomic context. No additional changes will be necessary
to this patchset.

V1-V2:
- page_referenced_one: Do not increment reference count if it is already
  != 0.
- Use rcu_assign_pointer and rcu_derefence_pointer instead of putting in our
  own barriers.

V2-V3:
- Fix rcu (thanks Paul)
- Fix exit code handling to come up with the right semantings for emm_referenced
  (thanks Andrea)
- Call mm_lock/mm_unlock to protect against registration races.

Acked-by: Paul E. McKenney [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/mm_types.h |3 +
 include/linux/rmap.h |   50 +++
 kernel/fork.c|3 +
 mm/Kconfig   |5 ++
 mm/filemap_xip.c |4 +
 mm/fremap.c  |2 
 mm/hugetlb.c |3 +
 mm/memory.c  |   42 +++
 mm/mmap.c|3 +
 mm/mprotect.c|3 +
 mm/mremap.c  |4 +
 mm/rmap.c|  100 ++-
 12 files changed, 212 insertions(+), 10 deletions(-)

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h 2008-04-04 14:55:03.441593394 
-0700
+++ linux-2.6/include/linux/mm_types.h  2008-04-04 15:07:38.857699751 -0700
@@ -225,6 +225,9 @@ struct mm_struct {
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
struct mem_cgroup *mem_cgroup;
 #endif
+#ifdef CONFIG_EMM_NOTIFIER
+   struct emm_notifier *emm_notifier;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/mm/Kconfig
===
--- linux-2.6.orig/mm/Kconfig   2008-04-04 14:55:03.457593678 -0700
+++ linux-2.6/mm/Kconfig2008-04-04 15:07:38.857699751 -0700
@@ -193,3 +193,8 @@ config NR_QUICK
 config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config EMM_NOTIFIER
+   def_bool n
+   bool External Mapped Memory Notifier for drivers directly mapping 
memory
+
Index: linux-2.6/include/linux/rmap.h
===
--- linux-2.6.orig/include/linux/rmap.h 2008-04-04 14:55:03.449593554 -0700
+++ linux-2.6/include/linux/rmap.h  2008-04-04 15:08:51.522883171 -0700
@@ -85,6 +85,56 @@ static inline void page_dup_rmap(struct 
 #endif
 
 /*
+ * Notifier for devices establishing their own references to Linux
+ * kernel pages in addition to the regular mapping via page
+ * table and rmap. The notifier allows the device to drop the mapping
+ * when the VM removes references to pages.
+ */
+enum emm_operation {
+   emm_release,/* Process exiting, */
+   emm_invalidate_start,   /* Before the VM unmaps pages */
+   emm_invalidate_end, /* After the VM unmapped pages */
+   emm_referenced  /* Check if a range was referenced */
+};
+
+struct emm_notifier {
+   int (*callback)(struct emm_notifier *e, struct mm_struct *mm,
+   enum emm_operation op,
+   unsigned long start, unsigned long end);
+   struct emm_notifier *next;
+};
+
+extern int __emm_notify(struct mm_struct

[ofa-general] [patch 08/10] xpmem: Locking rules for taking multiple mmap_sem locks.

2008-04-04 Thread Christoph Lameter

This patch adds a lock ordering rule to avoid a potential deadlock when
multiple mmap_sems need to be locked.

Signed-off-by: Dean Nelson [EMAIL PROTECTED]

---
 mm/filemap.c |3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2008-04-01 13:02:41.374608387 -0700
+++ linux-2.6/mm/filemap.c  2008-04-01 13:05:02.777015782 -0700
@@ -80,6 +80,9 @@ generic_file_direct_IO(int rw, struct ki
  *  -i_mutex  (generic_file_buffered_write)
  *-mmap_sem   (fault_in_pages_readable-do_page_fault)
  *
+ *When taking multiple mmap_sems, one should lock the lowest-addressed
+ *one first proceeding on up to the highest-addressed one.
+ *
  *  -i_mutex
  *-i_alloc_sem (various)
  *

-- 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [patch 03/10] emm: Move tlb flushing into free_pgtables

2008-04-04 Thread Christoph Lameter

Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables(). Moving the tlb flushing into free_pgtables allows
sleeping in parts of free_pgtables().

This means that we do a tlb_finish_mmu() before freeing the page tables.
Strictly speaking there may not be the need to do another tlb flush after
freeing the tables. But its the only way to free a series of page table
pages from the tlb list. And we do not want to call into the page allocator
for performance reasons. Aim9 numbers look okay after this patch.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/mm.h |4 ++--
 mm/memory.c|   14 ++
 mm/mmap.c  |6 +++---
 3 files changed, 15 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h   2008-03-19 13:30:51.460856986 -0700
+++ linux-2.6/include/linux/mm.h2008-03-19 13:31:20.809377398 -0700
@@ -751,8 +751,8 @@ int walk_page_range(const struct mm_stru
void *private);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
-   unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+   unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2008-03-19 13:29:06.007351495 -0700
+++ linux-2.6/mm/memory.c   2008-03-19 13:46:31.352774359 -0700
@@ -271,9 +271,11 @@ void free_pgd_range(struct mmu_gather **
} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
-   unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+   unsigned long ceiling)
 {
+   struct mmu_gather *tlb;
+
while (vma) {
struct vm_area_struct *next = vma-vm_next;
unsigned long addr = vma-vm_start;
@@ -285,8 +287,10 @@ void free_pgtables(struct mmu_gather **t
unlink_file_vma(vma);
 
if (is_vm_hugetlb_page(vma)) {
-   hugetlb_free_pgd_range(tlb, addr, vma-vm_end,
+   tlb = tlb_gather_mmu(vma-vm_mm, 0);
+   hugetlb_free_pgd_range(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
+   tlb_finish_mmu(tlb, addr, vma-vm_end);
} else {
/*
 * Optimization: gather nearby vmas into one call down
@@ -298,8 +302,10 @@ void free_pgtables(struct mmu_gather **t
anon_vma_unlink(vma);
unlink_file_vma(vma);
}
-   free_pgd_range(tlb, addr, vma-vm_end,
+   tlb = tlb_gather_mmu(vma-vm_mm, 0);
+   free_pgd_range(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
+   tlb_finish_mmu(tlb, addr, vma-vm_end);
}
vma = next;
}
Index: linux-2.6/mm/mmap.c
===
--- linux-2.6.orig/mm/mmap.c2008-03-19 13:29:48.659889667 -0700
+++ linux-2.6/mm/mmap.c 2008-03-19 13:30:36.296604891 -0700
@@ -1750,9 +1750,9 @@ static void unmap_region(struct mm_struc
update_hiwater_rss(mm);
unmap_vmas(tlb, vma, start, end, nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
-   free_pgtables(tlb, vma, prev? prev-vm_end: FIRST_USER_ADDRESS,
-next? next-vm_start: 0);
tlb_finish_mmu(tlb, start, end);
+   free_pgtables(vma, prev? prev-vm_end: FIRST_USER_ADDRESS,
+next? next-vm_start: 0);
emm_notify(mm, emm_invalidate_end, start, end);
 }
 
@@ -2049,8 +2049,8 @@ void exit_mmap(struct mm_struct *mm)
/* Use -1 here to ensure all VMAs in the mm are unmapped */
end = unmap_vmas(tlb, vma, 0, -1, nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
-   free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+   free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
/*
 * Walk the list again, actually closing and freeing it,

[ofa-general] [patch 05/10] emm: Remove tlb pointer from the parameters of unmap vmas

2008-04-04 Thread Christoph Lameter

We no longer abort unmapping in unmap vmas because we can reschedule while
unmapping since we are holding a semaphore. This would allow moving more
of the tlb flusing into unmap_vmas reducing code in various places.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/mm.h |3 +--
 mm/memory.c|   43 +--
 mm/mmap.c  |   18 +++---
 3 files changed, 21 insertions(+), 43 deletions(-)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h   2008-04-01 13:02:41.374608387 -0700
+++ linux-2.6/include/linux/mm.h2008-04-01 13:02:43.898651546 -0700
@@ -723,8 +723,7 @@ struct zap_details {
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
-   struct vm_area_struct *start_vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long 
start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *);
 
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2008-04-01 13:02:41.378608315 -0700
+++ linux-2.6/mm/memory.c   2008-04-01 13:02:43.902651345 -0700
@@ -806,7 +806,6 @@ static unsigned long unmap_page_range(st
 
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
- * @tlbp: address of the caller's struct mmu_gather
  * @vma: the starting vma
  * @start_addr: virtual address at which to start unmapping
  * @end_addr: virtual address at which to end unmapping
@@ -818,20 +817,13 @@ static unsigned long unmap_page_range(st
  * Unmap all pages in the vma list.
  *
  * We aim to not hold locks for too long (for scheduling latency reasons).
- * So zap pages in ZAP_BLOCK_SIZE bytecounts.  This means we need to
- * return the ending mmu_gather to the caller.
+ * So zap pages in ZAP_BLOCK_SIZE bytecounts.
  *
  * Only addresses between `start' and `end' will be unmapped.
  *
  * The VMA list must be sorted in ascending virtual address order.
- *
- * unmap_vmas() assumes that the caller will flush the whole unmapped address
- * range after unmap_vmas() returns.  So the only responsibility here is to
- * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
- * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
-   struct vm_area_struct *vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *details)
 {
@@ -839,7 +831,15 @@ unsigned long unmap_vmas(struct mmu_gath
unsigned long tlb_start = 0;/* For tlb_finish_mmu */
int tlb_start_valid = 0;
unsigned long start = start_addr;
-   int fullmm = (*tlbp)-fullmm;
+   int fullmm;
+   struct mmu_gather *tlb;
+   struct mm_struct *mm = vma-vm_mm;
+
+   emm_notify(mm, emm_invalidate_start, start_addr, end_addr);
+   lru_add_drain();
+   tlb = tlb_gather_mmu(mm, 0);
+   update_hiwater_rss(mm);
+   fullmm = tlb-fullmm;
 
for ( ; vma  vma-vm_start  end_addr; vma = vma-vm_next) {
unsigned long end;
@@ -866,7 +866,7 @@ unsigned long unmap_vmas(struct mmu_gath
(HPAGE_SIZE / PAGE_SIZE);
start = end;
} else
-   start = unmap_page_range(*tlbp, vma,
+   start = unmap_page_range(tlb, vma,
start, end, zap_work, details);
 
if (zap_work  0) {
@@ -874,13 +874,15 @@ unsigned long unmap_vmas(struct mmu_gath
break;
}
 
-   tlb_finish_mmu(*tlbp, tlb_start, start);
+   tlb_finish_mmu(tlb, tlb_start, start);
cond_resched();
-   *tlbp = tlb_gather_mmu(vma-vm_mm, fullmm);
+   tlb = tlb_gather_mmu(vma-vm_mm, fullmm);
tlb_start_valid = 0;
zap_work = ZAP_BLOCK_SIZE;
}
}
+   tlb_finish_mmu(tlb, start_addr, end_addr);
+   emm_notify(mm, emm_invalidate_end, start_addr, end_addr);
return start;   /* which is now the end (or restart) address */
 }
 
@@ -894,21 +896,10 @@ unsigned long unmap_vmas(struct mmu_gath
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct

[ofa-general] [patch 07/10] xpmem: This patch exports zap_page_range as it is needed by XPMEM.

2008-04-04 Thread Christoph Lameter

XPMEM would have used sys_madvise() except that madvise_dontneed()
returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages
XPMEM imports from other partitions and is also true for uncached pages
allocated locally via the mspec allocator.  XPMEM needs zap_page_range()
functionality for these types of pages as well as 'normal' pages.

Signed-off-by: Dean Nelson [EMAIL PROTECTED]

---
 mm/memory.c |1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2008-04-01 13:02:43.902651345 -0700
+++ linux-2.6/mm/memory.c   2008-04-01 13:04:43.720691616 -0700
@@ -901,6 +901,7 @@ unsigned long zap_page_range(struct vm_a
 
return unmap_vmas(vma, address, end, nr_accounted, details);
 }
+EXPORT_SYMBOL_GPL(zap_page_range);
 
 /*
  * Do a quick page-table lookup for a single page.

-- 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [patch 01/10] emm: mm_lock: Lock a process against reclaim

2008-04-04 Thread Jeremy Fitzhardinge


Christoph Lameter wrote:

Provide a way to lock an mm_struct against reclaim (try_to_unmap
etc). This is necessary for the invalidate notifier approaches so
that they can reliably add and remove a notifier.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/mm.h |   10 
 mm/mmap.c  |   66 +
 2 files changed, 76 insertions(+)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h   2008-04-02 11:41:47.741678873 -0700
+++ linux-2.6/include/linux/mm.h2008-04-04 15:02:17.660504756 -0700
@@ -1050,6 +1050,16 @@ extern int install_special_mapping(struc
   unsigned long addr, unsigned long len,
   unsigned long flags, struct page **pages);
 
+/*

+ * Locking and unlocking am mm against reclaim.
+ *
+ * mm_lock will take mmap_sem writably (to prevent additional vmas from being
+ * added) and then take all mapping locks of the existing vmas. With that
+ * reclaim is effectively stopped.
+ */
+extern void mm_lock(struct mm_struct *mm);
+extern void mm_unlock(struct mm_struct *mm);
+
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned 
long, unsigned long, unsigned long);
 
 extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,

Index: linux-2.6/mm/mmap.c
===
--- linux-2.6.orig/mm/mmap.c2008-04-04 14:55:03.477593980 -0700
+++ linux-2.6/mm/mmap.c 2008-04-04 14:59:05.505395402 -0700
@@ -2242,3 +2242,69 @@ int install_special_mapping(struct mm_st
 
 	return 0;

 }
+
+static void mm_lock_unlock(struct mm_struct *mm, int lock)
+{
+   struct vm_area_struct *vma;
+   spinlock_t *i_mmap_lock_last, *anon_vma_lock_last;
+
+   i_mmap_lock_last = NULL;
+   for (;;) {
+   spinlock_t *i_mmap_lock = (spinlock_t *) -1UL;
+   for (vma = mm-mmap; vma; vma = vma-vm_next)
+   if (vma-vm_file  vma-vm_file-f_mapping 
  

I think you can break this if() down a bit:

if (!(vma-vm_file  vma-vm_file-f_mapping))
continue;



+   (unsigned long) i_mmap_lock 
+   (unsigned long)
+   vma-vm_file-f_mapping-i_mmap_lock 
+   (unsigned long)
+   vma-vm_file-f_mapping-i_mmap_lock 
+   (unsigned long) i_mmap_lock_last)
+   i_mmap_lock =
+   vma-vm_file-f_mapping-i_mmap_lock;
  


So this is an O(n^2) algorithm to take the i_mmap_locks from low to high 
order?  A comment would be nice.  And O(n^2)?  Ouch.  How often is it 
called?


And is it necessary to mush lock and unlock together?  Unlock ordering 
doesn't matter, so you should just be able to have a much simpler loop, no?




+   if (i_mmap_lock == (spinlock_t *) -1UL)
+   break;
+   i_mmap_lock_last = i_mmap_lock;
+   if (lock)
+   spin_lock(i_mmap_lock);
+   else
+   spin_unlock(i_mmap_lock);
+   }
+
+   anon_vma_lock_last = NULL;
+   for (;;) {
+   spinlock_t *anon_vma_lock = (spinlock_t *) -1UL;
+   for (vma = mm-mmap; vma; vma = vma-vm_next)
+   if (vma-anon_vma 
+   (unsigned long) anon_vma_lock 
+   (unsigned long) vma-anon_vma-lock 
+   (unsigned long) vma-anon_vma-lock 
+   (unsigned long) anon_vma_lock_last)
+   anon_vma_lock = vma-anon_vma-lock;
+   if (anon_vma_lock == (spinlock_t *) -1UL)
+   break;
+   anon_vma_lock_last = anon_vma_lock;
+   if (lock)
+   spin_lock(anon_vma_lock);
+   else
+   spin_unlock(anon_vma_lock);
+   }
+}
  




+
+/*
+ * This operation locks against the VM for all pte/vma/mm related
+ * operations that could ever happen on a certain mm. This includes
+ * vmtruncate, try_to_unmap, and all page faults. The holder
+ * must not hold any mm related lock. A single task can't take more
+ * than one mm lock in a row or it would deadlock.
+ */
+void mm_lock(struct mm_struct * mm)
+{
+   down_write(mm-mmap_sem);
+   mm_lock_unlock(mm, 1);
+}
+
+void mm_unlock(struct mm_struct *mm)
+{
+   mm_lock_unlock(mm, 0);
+   up_write(mm-mmap_sem);
+}

  


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit

Re: [ofa-general] XmtDiscards

2008-04-04 Thread Bernd Schubert

Hello Boris,


On Fri, Apr 04, 2008 at 03:28:46PM -0700, Boris Shpolyansky wrote:
 Hi Bernd,
 
 You can configure the HOQ (Head-Of-Queue-Lifetime) value programmed in
 any switch in the fabric managed by OpenSM following these simple steps:
 
 1. Stop the SM
 /etc/init.d/opensmd stop
 
 2. Run the SM manually with the -c option (to dump its default
 configuration to a file)
 opensm -c
 
 3. Kill the SM with ^C
 
 4. The configuration is saved in /var/cache/opensm/opensm.opts. Open the
 file and look for head_of_queue_lifetime. Change the value and save the
 file.
 
 5. Restart the SM
 /etc/init.d/opensmd start

thanks a lot for your help. This did help quite a lot.

 
 P.S. You might find 'opensm -h' and 'man opensm' useful.

Sorry about my dumb question, I did read the man page of opensm quite often 
already, but --cache-options and OSM_CACHE_DIR did activate my 
brain-internal filter to entirely skip this part of the man page ;)
Somehow I associated cache with opensm-performance, but not at all with 
options...


Thanks again,
Bernd
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 2/4][v2] dapl: add support for logging errors in non-debug build.

2008-04-04 Thread Davis, Arlin R

Add debug logging (stdout, syslog) for error cases during
device open, cm, async, and dto operations. Default settings
are ERR for DAPL_DBG_TYPE, and stdout for DAPL_DBG_DEST.

Change default configuration to build non-debug.

Signed-off by: Arlin Davis [EMAIL PROTECTED]
---
 configure.in   |4 +-
 dapl/common/dapl_debug.c   |2 -
 dapl/common/dapl_evd_util.c|8 +-
 dapl/include/dapl_debug.h  |   10 ++-
 dapl/openib_cma/dapl_ib_cm.c   |  196
+++-
 dapl/openib_cma/dapl_ib_util.c |   87 +-
 dapl/udapl/dapl_init.c |   16 +++-
 dapl/udapl/linux/dapl_osd.h|2 +-
 8 files changed, 179 insertions(+), 146 deletions(-)

diff --git a/configure.in b/configure.in
index eaf597b..d1c2664 100644
--- a/configure.in
+++ b/configure.in
@@ -42,12 +42,12 @@ AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test
$ac_cv_version_script = yes)
 
 dnl Support debug mode build - if enable-debug provided the DEBUG
variable is set 
 AC_ARG_ENABLE(debug,
-[  --enable-debug Turn on debug mode, default=on],
+[  --enable-debug Turn on debug mode, default=off],
 [case ${enableval} in
   yes) debug=true ;;
   no)  debug=false ;;
   *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;;
-esac],[debug=true])
+esac],[debug=false])
 AM_CONDITIONAL(DEBUG, test x$debug = xtrue)
 
 dnl Support ib_extension build - if enable-ext-type == ib 
diff --git a/dapl/common/dapl_debug.c b/dapl/common/dapl_debug.c
index 7ddce52..cbc356c 100644
--- a/dapl/common/dapl_debug.c
+++ b/dapl/common/dapl_debug.c
@@ -32,7 +32,6 @@
 #include stdlib.h
 #endif /* __KDAPL__ */
 
-#ifdef DAPL_DBG
 DAPL_DBG_TYPE g_dapl_dbg_type; /* initialized in dapl_init.c */
 DAPL_DBG_DEST g_dapl_dbg_dest; /* initialized in dapl_init.c */
 
@@ -117,5 +116,4 @@ void dapl_dump_cntr( int cntr )
 }
 
 #endif /* DAPL_COUNTERS */
-#endif
 
diff --git a/dapl/common/dapl_evd_util.c b/dapl/common/dapl_evd_util.c
index a993b02..2ae1b59 100755
--- a/dapl/common/dapl_evd_util.c
+++ b/dapl/common/dapl_evd_util.c
@@ -1209,10 +1209,10 @@ dapli_evd_cqe_to_event (
dapl_os_unlock ( ep_ptr-header.lock );
}
 
-   dapl_dbg_log (DAPL_DBG_TYPE_DTO_COMP_ERR,
-  DTO completion ERROR: %d: op %#x (ep
disconnected)\n,
- DAPL_GET_CQE_STATUS (cqe_ptr),
- DAPL_GET_CQE_OPTYPE (cqe_ptr));
+   dapl_log(DAPL_DBG_TYPE_ERR,
+DTO completion ERR: status %d, opcode %s \n,
+DAPL_GET_CQE_STATUS(cqe_ptr),
+DAPL_GET_CQE_OP_STR(cqe_ptr));
 }
 }
 
diff --git a/dapl/include/dapl_debug.h b/dapl/include/dapl_debug.h
index 76db8fd..f0de7c8 100644
--- a/dapl/include/dapl_debug.h
+++ b/dapl/include/dapl_debug.h
@@ -75,14 +75,16 @@ typedef enum
 DAPL_DBG_DEST_SYSLOG   = 0x0002,
 } DAPL_DBG_DEST;
 
-
-#if defined(DAPL_DBG)
-
 extern DAPL_DBG_TYPE   g_dapl_dbg_type;
 extern DAPL_DBG_DEST   g_dapl_dbg_dest;
 
+extern void dapl_internal_dbg_log(DAPL_DBG_TYPE type,  const char *fmt,
...);
+
+#define dapl_log g_dapl_dbg_type==0 ? (void) 1 : dapl_internal_dbg_log
+
+#if defined(DAPL_DBG)
+
 #define dapl_dbg_log g_dapl_dbg_type==0 ? (void) 1 :
dapl_internal_dbg_log
-extern void dapl_internal_dbg_log ( DAPL_DBG_TYPE type,  const char
*fmt,  ...);
 
 #else  /* !DAPL_DBG */
 
diff --git a/dapl/openib_cma/dapl_ib_cm.c b/dapl/openib_cma/dapl_ib_cm.c
index a040ffb..33f299d 100755
--- a/dapl/openib_cma/dapl_ib_cm.c
+++ b/dapl/openib_cma/dapl_ib_cm.c
@@ -95,9 +95,9 @@ static void dapli_addr_resolve(struct dapl_cm_id
*conn)

ret =  rdma_resolve_route(conn-cm_id, conn-route_timeout);
if (ret) {
-   dapl_dbg_log(DAPL_DBG_TYPE_ERR, 
- rdma_connect failed:
%s\n,strerror(errno));
-
+   dapl_log(DAPL_DBG_TYPE_ERR, 
+ dapl_cma_connect: rdma_resolve_route ERR %d
%s\n,
+ret, strerror(errno));
dapl_evd_connection_callback(conn, 
 IB_CME_LOCAL_FAILURE, 
 NULL, conn-ep);
@@ -146,8 +146,9 @@ static void dapli_route_resolve(struct dapl_cm_id
*conn)
 
ret = rdma_connect(conn-cm_id, conn-params);
if (ret) {
-   dapl_dbg_log(DAPL_DBG_TYPE_ERR,  rdma_connect failed:
%s\n,
-strerror(errno));
+   dapl_log(DAPL_DBG_TYPE_ERR, 
+ dapl_cma_connect: rdma_connect ERR %d %s\n,
+ret, strerror(errno));
goto bail;
}
return;
@@ -310,12 +311,15 @@ static void dapli_cm_active_cb(struct dapl_cm_id
*conn,
case RDMA_CM_EVENT_UNREACHABLE:
case RDMA_CM_EVENT_CONNECT_ERROR:
{
-   dapl_dbg_log(
-DAPL_DBG_TYPE_WARN,
- dapli_cm_active_handler: CONN_ERR 
-

[ofa-general] [PATCH 4/4][v2] dapl: update vendor information for OFA v2 provider

2008-04-04 Thread Davis, Arlin R


Signed-off by: Arlin Davis [EMAIL PROTECTED]
---
 dapl/include/dapl_vendor.h |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/dapl/include/dapl_vendor.h b/dapl/include/dapl_vendor.h
index e87467a..f6d3cc0 100644
--- a/dapl/include/dapl_vendor.h
+++ b/dapl/include/dapl_vendor.h
@@ -52,14 +52,14 @@
  * Product name of the adapter.
  * Returned in DAT_IA_ATTR.adapter_name
  */
-#define VN_ADAPTER_NAMEGeneric InfiniBand HCA
+#define VN_ADAPTER_NAMEGeneric OpenFabrics HCA
 
 
 /*
  * Vendor name
  * Returned in DAT_IA_ATTR.vendor_name
  */
-#define VN_VENDOR_NAME DAPL Reference Implementation
+#define VN_VENDOR_NAME DAPL OpenFabrics Implementation
 
 
 /**
@@ -78,7 +78,7 @@
  * DAT_PROVIDER_ATTR.provider_version_minor
  */
 
-#define VN_PROVIDER_MAJOR  1
+#define VN_PROVIDER_MAJOR  2
 #define VN_PROVIDER_MINOR  0
 
 /*
-- 
1.5.2.5

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 3/4][v2] dapl: add provider vendor revision data in private data with reject

2008-04-04 Thread Davis, Arlin R

Add 1 byte header containing provider/vendor major revision
to distinguish between consumer and non-consumer rejects.
Validate size of consumer reject privated data.

Signed-off by: Arlin Davis [EMAIL PROTECTED]
---
 dapl/openib_cma/dapl_ib_cm.c   |   39
---
 dapl/openib_cma/dapl_ib_util.h |2 +-
 2 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/dapl/openib_cma/dapl_ib_cm.c b/dapl/openib_cma/dapl_ib_cm.c
index 33f299d..dcdcc5b 100755
--- a/dapl/openib_cma/dapl_ib_cm.c
+++ b/dapl/openib_cma/dapl_ib_cm.c
@@ -45,6 +45,7 @@
 #include dapl_cr_util.h
 #include dapl_name_service.h
 #include dapl_ib_util.h
+#include dapl_vendor.h
 #include sys/poll.h
 #include signal.h
 #include sys/socket.h
@@ -79,6 +80,14 @@ static inline uint64_t cpu_to_be64(uint64_t x) {
return x; }
 
 #define PORT_TO_SID(p) ntohs(p)
 
+/* private data header to validate consumer rejects versus abnormal
events */
+struct dapl_pdata_hdr {
+   uint8_t  version;
+};
+static struct dapl_pdata_hdr pdata_hdr = {
+   .version = VN_PROVIDER_MAJOR 
+};
+
 static void dapli_addr_resolve(struct dapl_cm_id *conn)
 {
int ret;
@@ -900,6 +909,7 @@ dapls_ib_reject_connection(
IN const DAT_PVOID private_data)
 {
int ret;
+   int offset = sizeof(struct dapl_pdata_hdr);
 
dapl_dbg_log(DAPL_DBG_TYPE_CM,
  reject(cm_handle %p reason %x)\n,
@@ -909,14 +919,29 @@ dapls_ib_reject_connection(
dapl_dbg_log(DAPL_DBG_TYPE_ERR,
  reject: invalid handle: reason %d\n,
 reason);
-   return DAT_SUCCESS;
+   return DAT_ERROR
(DAT_INVALID_HANDLE,DAT_INVALID_HANDLE_CR);
}
-
+
+if (private_data_size  
+   dapls_ib_private_data_size(
+   NULL, IB_MAX_REJ_PDATA_SIZE, cm_handle-hca))
+   return DAT_ERROR(DAT_INVALID_PARAMETER,
DAT_INVALID_ARG3);
+   
+   /* setup pdata_hdr and users data, in CR pdata buffer */
+   dapl_os_memcpy(cm_handle-p_data, pdata_hdr, offset);
+   if (private_data_size)
+   dapl_os_memcpy(cm_handle-p_data+offset,
+  private_data, 
+  private_data_size);
+   
/*
- * Private data is needed so peer can determine real
application 
-* reject from an abnormal application termination
+* Always some private data with reject so active peer can
+ * determine real application reject from an abnormal 
+* application termination
 */
-   ret = rdma_reject(cm_handle-cm_id, NULL, 0);
+   ret = rdma_reject(cm_handle-cm_id, 
+ cm_handle-p_data, 
+ offset+private_data_size);
 
dapli_destroy_conn(cm_handle);
return dapl_convert_errno(ret, reject);
@@ -1005,7 +1030,7 @@ int dapls_ib_private_data_size(   IN DAPL_PRIVATE
*prd_ptr,
 
 if (hca_ptr-ib_hca_handle-device-transport_type 
== IBV_TRANSPORT_IWARP)
-   return(IWARP_MAX_PDATA_SIZE);
+   return(IWARP_MAX_PDATA_SIZE-sizeof(struct
dapl_pdata_hdr));
 
switch(conn_op) {
 
@@ -1016,7 +1041,7 @@ int dapls_ib_private_data_size(   IN DAPL_PRIVATE
*prd_ptr,
size = IB_MAX_REP_PDATA_SIZE;
break;
case DAPL_PDATA_CONN_REJ:
-   size = IB_MAX_REJ_PDATA_SIZE;
+   size = IB_MAX_REJ_PDATA_SIZE-sizeof(struct
dapl_pdata_hdr);
break;
case DAPL_PDATA_CONN_DREQ:
size = IB_MAX_DREQ_PDATA_SIZE;
diff --git a/dapl/openib_cma/dapl_ib_util.h
b/dapl/openib_cma/dapl_ib_util.h
index f35cb9d..370f3b1 100755
--- a/dapl/openib_cma/dapl_ib_util.h
+++ b/dapl/openib_cma/dapl_ib_util.h
@@ -181,7 +181,7 @@ struct dapl_cm_id {
struct rdma_conn_param  params;
DAT_SOCK_ADDR6  r_addr;
int p_len;
-   unsigned char   p_data[IB_MAX_DREP_PDATA_SIZE];
+   unsigned char   p_data[256]; /* dapl max private
data size */
 };
 
 typedef struct dapl_cm_id  *dp_ib_cm_handle_t;
-- 
1.5.2.5

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] XmtDiscards

2008-04-04 Thread Bernd Schubert

On Fri, Apr 04, 2008 at 03:29:32PM -0700, Ira Weiny wrote:
 On Sat, 5 Apr 2008 00:12:39 +0200
 Bernd Schubert [EMAIL PROTECTED] wrote:
 
  Hello,
  
  after I upgraded one of our clusters to opensm-3.2.1 it seems to have 
  gotten 
  much better there, at least no further RcvSwRelayErrors, even when the 
  cluster is in idle state and so far also no SymbolErrors, which we also 
  have 
  seens before.
  
  However, after I just started a lustre stress test on 50 clients (to a 
  lustre 
  storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports 
  about 
  9000 XmtDiscards within 30 minutes.
 
 Yea, those are bad.
 
  
  Searching for this error I find This is a symptom of congestion and may 
  require tweaking either HOQ or switch lifetime values. 
  Well, I have to admit I neither know what HOQ is, nor do I know how to 
  tweak 
  it. I also do not have an idea to set switch lifetime values.  I guess this 
  isn't related to the opensm timeout option, is it?
 
 Yes you should adjust these values.
 
  
  Hmm, I just found a cisci pdf describing how to set the lifetime on these 
  switches, but is this also possible on Flextronics switches?
  
 
 I don't know about the Vendor SMs but in opensm look for the following options
 in the opensm.opts file (Default path is: /var/cache/opensm):
 
# The code of maximal time a packet can wait at the head of
# transmission queue.
# The actual time is 4.096usec * 2^head_of_queue_lifetime
# The value 0x14 disables this mechanism
head_of_queue_lifetime 0x12

# The maximal time a packet can wait at the head of queue on
# switch port connected to a CA or router port
leaf_head_of_queue_lifetime 0x0c

Hmm, I first increased head_of_queue_lifetime to 0x13 and 
leaf_head_of_queue_lifetime to 0x20, but this didn't make the error 
go away. So I increased head_of_queue_lifetime to 0x15 and 
leaf_head_of_queue_lifetime  to 0x50, but this made the fabric to entirely
crash. On the node of the master opensm I got an endless number of messages
like these:

Apr  5 01:35:03 pfs1n2 kernel: [705448.344542] NETDEV WATCHDOG: ib0: transmit 
timed out
Apr  5 01:35:03 pfs1n2 kernel: [705448.349814] ib0: transmit timeout: latency 
411908 msecs
Apr  5 01:35:03 pfs1n2 kernel: [705448.355364] ib0: queue stopped 1, tx_head 
441, tx_tail 377
Apr  5 01:35:04 pfs1n2 kernel: [705449.343495] NETDEV WATCHDOG: ib0: transmit 
timed out

The slave opensm also went into D-state and is not killable anymore :(

Seems I have to be very careful with these settings...


Thanks for your help,
Bernd
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH 3/4][v2] dapl: add provider vendor revisiondata in private data with reject

2008-04-04 Thread Sean Hefty

Add 1 byte header containing provider/vendor major revision
to distinguish between consumer and non-consumer rejects.
Validate size of consumer reject privated data.

Not saying this is a bad idea, but doesn't it break the protocol with existing
DAPL?  It also shifts all of the existing private data off by a byte, which
could result in odd data alignment.

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH] mmu notifier #v11

2008-04-04 Thread Andrea Arcangeli

On Fri, Apr 04, 2008 at 03:06:18PM -0700, Christoph Lameter wrote:
 Adds some comments. Still objectionable is the multiple ways of
 invalidating pages in #v11. Callout now has similar locking to emm.

range_begin exists because range_end is called after the page has
already been freed. invalidate_page is called _before_ the page is
freed but _after_ the pte has been zapped.

In short when working with single pages it's a waste to block the
secondary-mmu page fault, because it's zero cost to invalidate_page
before put_page. Not even GRU need to do that.

Instead for the multiple-pte-zapping we have to call range_end _after_
the page is already freed. This is so that there is a single range_end
call for an huge amount of address space. So we need a range_begin for
the subsystems not using page pinning for example. When working with
single pages (try_to_unmap_one, do_wp_page) invalidate_page avoids to
block the secondary mmu page fault, and it's in turn faster.

Besides avoiding need of serializing the secondary mmu page fault,
invalidate_page also reduces the overhead when the mmu notifiers are
disarmed (i.e. kvm not running).
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [patch 01/10] emm: mm_lock: Lock a process against reclaim

2008-04-04 Thread Andrea Arcangeli

On Fri, Apr 04, 2008 at 04:12:42PM -0700, Jeremy Fitzhardinge wrote:
 I think you can break this if() down a bit:

   if (!(vma-vm_file  vma-vm_file-f_mapping))
   continue;

It makes no difference at runtime, coding style preferences are quite
subjective.

 So this is an O(n^2) algorithm to take the i_mmap_locks from low to high 
 order?  A comment would be nice.  And O(n^2)?  Ouch.  How often is it 
 called?

It's called a single time when the mmu notifier is registered. It's a
very slow path of course. Any other approach to reduce the complexity
would require memory allocations and it would require
mmu_notifier_register to return -ENOMEM failure. It didn't seem worth
it.

 And is it necessary to mush lock and unlock together?  Unlock ordering 
 doesn't matter, so you should just be able to have a much simpler loop, no?

That avoids duplicating .text. Originally they were separated. unlock
can't be a simpler loop because I didn't reserve vm_flags bitflags to
do a single O(N) loop for unlock. If you do malloc+fork+munmap two
vmas will point to the same anon-vma lock, that's why the unlock isn't
simpler unless I mark what I locked with a vm_flags bitflag.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] XmtDiscards

2008-04-04 Thread Boris Shpolyansky

Bernd,

0x14 is the maximal value for HOQ lifetime, which effectively disables
the mechanism. I think you shouldn't exceed this value. 


Boris

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Bernd
Schubert
Sent: Friday, April 04, 2008 4:46 PM
To: Ira Weiny
Cc: general@lists.openfabrics.org
Subject: Re: [ofa-general] XmtDiscards

On Fri, Apr 04, 2008 at 03:29:32PM -0700, Ira Weiny wrote:
 On Sat, 5 Apr 2008 00:12:39 +0200
 Bernd Schubert [EMAIL PROTECTED] wrote:
 
  Hello,
  
  after I upgraded one of our clusters to opensm-3.2.1 it seems to 
  have gotten much better there, at least no further RcvSwRelayErrors,

  even when the cluster is in idle state and so far also no 
  SymbolErrors, which we also have seens before.
  
  However, after I just started a lustre stress test on 50 clients (to

  a lustre storage system with 20 OSS servers and 60 OSTs), 
  ibcheckerrors reports about 9000 XmtDiscards within 30 minutes.
 
 Yea, those are bad.
 
  
  Searching for this error I find This is a symptom of congestion and

  may require tweaking either HOQ or switch lifetime values.
  Well, I have to admit I neither know what HOQ is, nor do I know how 
  to tweak it. I also do not have an idea to set switch lifetime 
  values.  I guess this isn't related to the opensm timeout option, is
it?
 
 Yes you should adjust these values.
 
  
  Hmm, I just found a cisci pdf describing how to set the lifetime on 
  these switches, but is this also possible on Flextronics switches?
  
 
 I don't know about the Vendor SMs but in opensm look for the following

 options in the opensm.opts file (Default path is: /var/cache/opensm):
 
# The code of maximal time a packet can wait at the head of
# transmission queue.
# The actual time is 4.096usec * 2^head_of_queue_lifetime
# The value 0x14 disables this mechanism
head_of_queue_lifetime 0x12

# The maximal time a packet can wait at the head of queue on
# switch port connected to a CA or router port
leaf_head_of_queue_lifetime 0x0c

Hmm, I first increased head_of_queue_lifetime to 0x13 and
leaf_head_of_queue_lifetime to 0x20, but this didn't make the error go
away. So I increased head_of_queue_lifetime to 0x15 and
leaf_head_of_queue_lifetime  to 0x50, but this made the fabric to
entirely crash. On the node of the master opensm I got an endless number
of messages like these:

Apr  5 01:35:03 pfs1n2 kernel: [705448.344542] NETDEV WATCHDOG: ib0:
transmit timed out Apr  5 01:35:03 pfs1n2 kernel: [705448.349814] ib0:
transmit timeout: latency 411908 msecs Apr  5 01:35:03 pfs1n2 kernel:
[705448.355364] ib0: queue stopped 1, tx_head 441, tx_tail 377 Apr  5
01:35:04 pfs1n2 kernel: [705449.343495] NETDEV WATCHDOG: ib0: transmit
timed out

The slave opensm also went into D-state and is not killable anymore :(

Seems I have to be very careful with these settings...


Thanks for your help,
Bernd
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [patch 02/10] emm: notifier logic

2008-04-04 Thread Andrea Arcangeli

On Fri, Apr 04, 2008 at 03:30:50PM -0700, Christoph Lameter wrote:
 + mm_lock(mm);
 + e-next = mm-emm_notifier;
 + /*
 +  * The update to emm_notifier (e-next) must be visible
 +  * before the pointer becomes visible.
 +  * rcu_assign_pointer() does exactly what we need.
 +  */
 + rcu_assign_pointer(mm-emm_notifier, e);
 + mm_unlock(mm);

My mm_lock solution makes all rcu serialization an unnecessary
overhead so you should remove it like I already did in #v11. If it
wasn't the case, then mm_lock wouldn't be a definitive fix for the
race.

 + e = rcu_dereference(e-next);

Same here.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

87 matches

Mail list logo