date:20071009

Re: [ofa-general] Re: [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver

2007-10-09 Thread Moni Shoua

Jay Vosburgh wrote:
 Jeff Garzik [EMAIL PROTECTED] wrote:
 
 Moni Shoua wrote:
 Jay Vosburgh wrote:
 ACK patches 3 - 9.

Roland, are you comfortable with the IB changes in patches 1 and 2?

Jeff, when Roland acks patches 1 and 2, please apply all 9.

-J
 Hi Jeff,
 Roland acked the IPoIB patches. If you haven't done so already can you 
 please apply them?
 I'm not sure when 2.6.24 is going to open and I'm afraid to miss it.
 hrm, I don't see them in my inbox for some reason.  can someone bounce
 them to me?  or give me a git tree to pull from?
 
   Moni, can you repost the patch series to Jeff, and put the
 appropriate Acked-by lines in for myself (patches 3 - 8) and Roland
 (patches 1 and 2)?  You can probably leave off the netdev and
 openfabrics lists, but cc me.
 
   -J
 
 ---
   -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
 
Hi Jeff,
I don't commits of the patches in 
http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=summary
(I hope that I'm looking in the right place).

Did you get them?

thanks
MoniS


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] question regarding umad_recv

2007-10-09 Thread Sumit Gaur - Sun Microsystem


Hi,

It is regarding *umad_recv* function of libibumad/src/umad.c file. Is it not 
possible to recv MAD specific to GSI or SMI type. As per my impression if I have 
two separate threads to send and receive then I could send MADs to different qp 
0 or 1 depend on GSI and SMI MAD. But receiving has no control over it. Please 
suggest if there is any workaround for it.


Thanks and Regards
sumit
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCHES] TX batching

2007-10-09 Thread Krishna Kumar2

J Hadi Salim [EMAIL PROTECTED] wrote on 10/08/2007 07:35:20 PM:

 I dont see something from Krishna's approach that i can take and reuse.
 This maybe because my old approaches have evolved from the same path.
 There is a long list but as a sample: i used to do a lot more work while
 holding the queue lock which i have now moved post queue lock; i dont
 have any speacial interfaces/tricks just for batching, i provide hints
 to the core of how much the driver can take etc etc. I have offered
 Krishna co-authorship if he makes the IPOIB driver to work on my
 patches, that offer still stands if he chooses to take it.

My feeling is that since the approaches are very different, it would
be a good idea to test the two for performance. Do you mind me doing
that? Ofcourse others and/or you are more than welcome to do the same.

I had sent a note to you yesterday about this, please let me know
either way.

*** Previous mail **

Hi Jamal,

If you don't mind, I am trying to run your approach vs mine to get some
results
for comparison.

For starters, I am having issues with iperf when using your infrastructure
code with
my IPoIB driver - about 100MB is sent and then everything stops for some
reason.
The changes in the IPoIB driver that I made to support batching is to set
BTX, set
xmit_win, and dynamically reduce xmit_win on every xmit and increase
xmit_win on
every xmit completion. Is there anything else that is required from the
driver?

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [PATCH 13 of 17]: add LRO support

2007-10-09 Thread Or Gerlitz


Eli Cohen wrote:
Since you have posted the patch, I am asking you if it has any negative 
influence on packet forwarding.


I am not asking you to test it or whether you tested it with forwarding.



The answer is yes since I do not recalculate TCP checksum as I aggregate
the SKBs so the kernel might forward the TCP segment as multiple IP
packets but with wrong TCP checksum (which is that of the first
aggregated packet) but not of the overall aggregated segment.


OK, thanks for this clarification.

Can you clarify if/how this patch is related to the lro: Generic Large 
Receive Offload for TCP traffic RFC sent on August this year to netdev 
(eg see http://lwn.net/Articles/244206) ?


Assuming LRO is a --pure software-- optimization, what's the rational to 
put its whole implementation in the ipoib driver and not divide it to 
general part implemented in the net core and per driver part implemented 
per device driver that wants to support LRO (if such second part is 
needed at all)?


If I am wrong and their is some LRO assistance from the connectX HW, 
what is it doing?


Or.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 1/3] osm: QoS- bug in opening policy file

2007-10-09 Thread Yevgeny Kliteynik

Fixing bug in opening QoS policy file

Signed-off-by: Yevgeny Kliteynik [EMAIL PROTECTED]
---
 opensm/opensm/osm_qos_parser.y |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y
index e0faaaf..8e9f282 100644
--- a/opensm/opensm/osm_qos_parser.y
+++ b/opensm/opensm/osm_qos_parser.y
@@ -50,6 +50,7 @@
 #include stdlib.h
 #include string.h
 #include ctype.h
+#include errno.h
 #include sys/stat.h
 #include opensm/osm_opensm.h
 #include opensm/osm_qos_policy.h
@@ -129,6 +130,7 @@ extern char * __qos_parser_text;
 extern void __qos_parser_error (char *s);
 extern int __qos_parser_lex (void);
 extern FILE * __qos_parser_in;
+extern int errno;

 #define RESET_BUFFER  __parser_tmp_struct_reset()

@@ -1750,13 +1752,13 @@ int osm_qos_parse_policy_file(IN osm_subn_t * const 
p_subn)
 osm_qos_policy_destroy(p_subn-p_qos_policy);
 p_subn-p_qos_policy = NULL;

-if (!stat(p_subn-opt.qos_policy_file, statbuf)) {
+if (stat(p_subn-opt.qos_policy_file, statbuf)) {

 if (strcmp(p_subn-opt.qos_policy_file,OSM_DEFAULT_QOS_POLICY_FILE)) {
 osm_log(p_qos_parser_osm_log, OSM_LOG_ERROR,
 osm_qos_parse_policy_file: ERR AC01: 
-QoS policy file not found (%s)\n,
-p_subn-opt.qos_policy_file);
+Failed opening QoS policy file %s - %s\n,
+p_subn-opt.qos_policy_file, strerror(errno));
 res = 1;
 }
 else
-- 
1.5.1.4


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 2/3] osm: QoS - fixing memory leaks

2007-10-09 Thread Yevgeny Kliteynik

Fixing bunch of memory leaks and pointer mismatches in QoS.

Signed-off-by: Yevgeny Kliteynik [EMAIL PROTECTED]
---
 opensm/opensm/osm_qos_parser.l |   16 
 opensm/opensm/osm_qos_parser.y |   15 ---
 opensm/opensm/osm_qos_policy.c |   21 +
 3 files changed, 33 insertions(+), 19 deletions(-)

diff --git a/opensm/opensm/osm_qos_parser.l b/opensm/opensm/osm_qos_parser.l
index 0b096f8..60b2d1c 100644
--- a/opensm/opensm/osm_qos_parser.l
+++ b/opensm/opensm/osm_qos_parser.l
@@ -260,33 +260,41 @@ WHITE_DOTDOT_WHITE  [ \t]*:[ \t]*

 -   {
 SAVE_POS;
-__qos_parser_lval = strdup(__qos_parser_text);
 if (in_description || in_list_of_strings || 
in_single_string)
+{
+__qos_parser_lval = strdup(__qos_parser_text);
 return TK_TEXT;
+}
 return TK_DASH;
 }

 :   {
 SAVE_POS;
-__qos_parser_lval = strdup(__qos_parser_text);
 if (in_description || in_list_of_strings || 
in_single_string)
+{
+__qos_parser_lval = strdup(__qos_parser_text);
 return TK_TEXT;
+}
 return TK_DOTDOT;
 }

 ,   {
 SAVE_POS;
-__qos_parser_lval = strdup(__qos_parser_text);
 if (in_description)
+{
+__qos_parser_lval = strdup(__qos_parser_text);
 return TK_TEXT;
+}
 return TK_COMMA;
 }

 \*  {
 SAVE_POS;
-__qos_parser_lval = strdup(__qos_parser_text);
 if (in_description || in_list_of_strings || 
in_single_string)
+{
+__qos_parser_lval = strdup(__qos_parser_text);
 return TK_TEXT;
+}
 return TK_ASTERISK;
 }

diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y
index 8e9f282..2405519 100644
--- a/opensm/opensm/osm_qos_parser.y
+++ b/opensm/opensm/osm_qos_parser.y
@@ -2105,15 +2105,15 @@ static void __sort_reduce_rangearr(
 unsigned last_valid_ind = 0;
 unsigned valid_cnt = 0;
 uint64_t ** res_arr;
-boolean_t * is_valir_arr;
+boolean_t * is_valid_arr;

 *p_res_arr = NULL;
 *p_res_arr_len = 0;

 qsort(arr, arr_len, sizeof(uint64_t*), __cmp_num_range);

-is_valir_arr = (boolean_t *)malloc(arr_len * sizeof(boolean_t));
-is_valir_arr[last_valid_ind] = TRUE;
+is_valid_arr = (boolean_t *)malloc(arr_len * sizeof(boolean_t));
+is_valid_arr[last_valid_ind] = TRUE;
 valid_cnt++;
 for (i = 1; i  arr_len; i++)
 {
@@ -2123,18 +2123,18 @@ static void __sort_reduce_rangearr(
 arr[last_valid_ind][1] = arr[i][1];
 free(arr[i]);
 arr[i] = NULL;
-is_valir_arr[i] = FALSE;
+is_valid_arr[i] = FALSE;
 }
 else if ((arr[i][0] - 1) == arr[last_valid_ind][1])
 {
 arr[last_valid_ind][1] = arr[i][1];
 free(arr[i]);
 arr[i] = NULL;
-is_valir_arr[i] = FALSE;
+is_valid_arr[i] = FALSE;
 }
 else
 {
-is_valir_arr[i] = TRUE;
+is_valid_arr[i] = TRUE;
 last_valid_ind = i;
 valid_cnt++;
 }
@@ -2143,9 +2143,10 @@ static void __sort_reduce_rangearr(
 res_arr = (uint64_t **)malloc(valid_cnt * sizeof(uint64_t *));
 for (i = 0; i  arr_len; i++)
 {
-if (is_valir_arr[i])
+if (is_valid_arr[i])
 res_arr[j++] = arr[i];
 }
+free(is_valid_arr);
 free(arr);

 *p_res_arr = res_arr;
diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c
index c84fb8b..51dd7b9 100644
--- a/opensm/opensm/osm_qos_policy.c
+++ b/opensm/opensm/osm_qos_policy.c
@@ -101,12 +101,6 @@ static void __free_single_element(void *p_element, void 
*context)
free(p_element);
 }

-static void __free_port_map_element(cl_map_item_t *p_element, void *context)
-{
-   if (p_element)
-   free(p_element);
-}
-
 /***
  ***/

@@ -145,6 +139,9 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create()

 void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p)
 {
+   osm_qos_port_t * p_port;
+   osm_qos_port_t * p_old_port;
+
if (!p)

[ofa-general] [PATCH 3/3] osm: QoS - parsing port names

2007-10-09 Thread Yevgeny Kliteynik

Added CA-by-name hash to the QoS policy object and
as port names are parsed they use this hash to locate
that actual port that the name refers to.
For now I prefer to keep this hash local, so it's part
of QoS policy object.
When the same parser will be used for partitions too,
this hash will be moved to be part of the subnet object.

Signed-off-by: Yevgeny Kliteynik [EMAIL PROTECTED]
---
 opensm/include/opensm/osm_qos_policy.h |3 +-
 opensm/opensm/osm_qos_parser.y |   73 +++-
 opensm/opensm/osm_qos_policy.c |   36 +---
 3 files changed, 94 insertions(+), 18 deletions(-)

diff --git a/opensm/include/opensm/osm_qos_policy.h 
b/opensm/include/opensm/osm_qos_policy.h
index 30c2e6d..5c32896 100644
--- a/opensm/include/opensm/osm_qos_policy.h
+++ b/opensm/include/opensm/osm_qos_policy.h
@@ -49,6 +49,7 @@

 #include iba/ib_types.h
 #include complib/cl_list.h
+#include opensm/st.h
 #include opensm/osm_port.h
 #include opensm/osm_partition.h
 #include opensm/osm_sa_path_record.h
@@ -72,7 +73,6 @@ typedef struct _osm_qos_port_t {
 typedef struct _osm_qos_port_group_t {
char *name; /* single string (this port group name) 
*/
char *use;  /* single string (description) */
-   cl_list_t port_name_list;   /* list of port names (.../.../...) */
uint8_t node_types; /* node types bitmask */
cl_qmap_t port_map;
 } osm_qos_port_group_t;
@@ -148,6 +148,7 @@ typedef struct _osm_qos_policy_t {
cl_list_t qos_match_rules;  /* list of osm_qos_match_rule_t 
*/
osm_qos_level_t *p_default_qos_level;   /* default QoS level */
osm_subn_t *p_subn; /* osm subnet object */
+   st_table * p_ca_hash;   /* hash of CAs by node 
description */
 } osm_qos_policy_t;

 /***/
diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y
index 2405519..cf342d3 100644
--- a/opensm/opensm/osm_qos_parser.y
+++ b/opensm/opensm/osm_qos_parser.y
@@ -603,23 +603,74 @@ port_group_use_start:   TK_USE {

 port_group_port_name:   port_group_port_name_start string_list {
 /* 'port-name' in 'port-group' - any num of 
instances */
-cl_list_iterator_tlist_iterator;
-char* tmp_str;
-
-list_iterator = 
cl_list_head(tmp_parser_struct.str_list);
-while( list_iterator != 
cl_list_end(tmp_parser_struct.str_list) )
+cl_list_iterator_t list_iterator;
+osm_node_t * p_node;
+osm_physp_t * p_physp;
+unsigned port_num;
+char * name_str;
+char * tmp_str;
+char * host_str;
+char * ca_str;
+char * port_str;
+char * node_desc = 
(char*)malloc(IB_NODE_DESCRIPTION_SIZE + 1);
+
+/* parsing port name strings */
+for (list_iterator = 
cl_list_head(tmp_parser_struct.str_list);
+ list_iterator != 
cl_list_end(tmp_parser_struct.str_list);
+ list_iterator = cl_list_next(list_iterator))
 {
 tmp_str = (char*)cl_list_obj(list_iterator);
+if (tmp_str  *tmp_str)
+{
+name_str = tmp_str;
+host_str = strtok (name_str,/);
+ca_str = strtok (NULL, /);
+port_str = strtok (NULL, /);
+
+if (!host_str || !(*host_str) ||
+!ca_str || !(*ca_str) ||
+!port_str || !(*port_str) ||
+(port_str[0] != 'p'  port_str[0] != 
'P')) {
+yyerror(illegal port name);
+free(tmp_str);
+free(node_desc);
+
cl_list_remove_all(tmp_parser_struct.str_list);
+return 1;
+}

-/*
- * TODO: parse port name strings
- */
+if (!(port_num = 
strtoul(port_str[1],NULL,0))) {
+yyerror(illegal port number in port 
name);
+free(tmp_str);
+

[ofa-general] ofa_1_3_kernel 20071009-0200 daily build status

2007-10-09 Thread Vladimir Sokolovsky (Mellanox)

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod 
--with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod 
--with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod 
--with-nes-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.20
Passed on ia64 with linux-2.6.12
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.17
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.19
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.14
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.22
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ppc64 with linux-2.6.18-8.el5

Failed:
Build failed on x86_64 with linux-2.6.9-22.ELsmp
Log:
Applying patch libiscsi_no_flush_to_2_6_9.patch
patching file drivers/scsi/libiscsi.c
Hunk #1 FAILED at 1225.
Hunk #2 succeeded at 1640 (offset 32 lines).
Hunk #3 FAILED at 1784.
2 out of 3 hunks FAILED -- rejects in file drivers/scsi/libiscsi.c
Patch libiscsi_no_flush_to_2_6_9.patch does not apply (enforce with -f)

Failed executing /usr/bin/quilt
--
Build failed on powerpc with linux-2.6.19
Log:
/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936:
 error: invalid type argument of '-'
/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939:
 error: invalid type argument of '-'
/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940:
 error: invalid type argument of '-'
make[4]: *** 
[/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o]
 Error 1
make[3]: *** 
[/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca]
 Error 2
make[2]: *** 
[/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband]
 Error 2
make[1]: *** 
[_module_/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check]
 Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19'
make: *** [kernel] Error 2
--
Build failed on powerpc with linux-2.6.18
Log:
/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936:
 error: invalid type argument of '-'
/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939:
 error: invalid type argument of '-'
/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940:
 error: invalid type argument of '-'
make[4]: *** 
[/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o]
 Error 1
make[3]: *** 
[/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca]
 Error 2
make[2]: *** 
[/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband]
 Error 2
make[1]: *** 
[_module_/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check]
 Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux

Re: [ofa-general] [PATCH 13 of 17]: add LRO support

2007-10-09 Thread Eli Cohen


 Can you clarify if/how this patch is related to the lro: Generic Large 
 Receive Offload for TCP traffic RFC sent on August this year to netdev 
 (eg see http://lwn.net/Articles/244206) ?

I referred to mtnic driver when I made this patch which referred to
other code examples, possibly from this one too.

 
 Assuming LRO is a --pure software-- optimization, what's the rational to 
 put its whole implementation in the ipoib driver and not divide it to 
 general part implemented in the net core and per driver part implemented 
 per device driver that wants to support LRO (if such second part is 
 needed at all)?

It is a pure software optimization but it relies on the HW to report
whether the checksum of the packet is valid or not in order for it to be
liable for aggregation. I think it would be good however if the kernel
would support this and take this from the specific drivers.
 
 If I am wrong and their is some LRO assistance from the connectX HW, 
 what is it doing?
 
 Or.
 
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Krishna Kumar2

Hi Peter,

Waskiewicz Jr, Peter P [EMAIL PROTECTED] wrote on
10/09/2007 04:03:42 AM:

  true, that needs some resolution. Heres a hand-waving thought:
  Assuming all packets of a specific map end up in the same
  qdiscn queue, it seems feasible to ask the qdisc scheduler to
  give us enough packages (ive seen people use that terms to
  refer to packets) for each hardware ring's available space.
  With the patches i posted, i do that via
  dev-xmit_win that assumes only one view of the driver; essentially a
  single ring.
  If that is doable, then it is up to the driver to say i have
  space for 5 in ring[0], 10 in ring[1] 0 in ring[2] based on
  what scheduling scheme the driver implements - the dev-blist
  can stay the same. Its a handwave, so there may be issues
  there and there could be better ways to handle this.
 
  Note: The other issue that needs resolving that i raised
  earlier was in regards to multiqueue running on multiple cpus
  servicing different rings concurently.

 I can see the qdisc being modified to send batches per queue_mapping.
 This shouldn't be too difficult, and if we had the xmit_win per queue
 (in the subqueue struct like Dave pointed out).

I hope my understanding of multiqueue is correct for this mail to make
sense :-)

Isn't it enough that the multiqueue+batching drivers handle skbs
belonging to different queue's themselves, instead of qdisc having
to figure that out? This will reduce costs for most skbs that are
neither batched nor sent to multiqueue devices.

Eg, driver can keep processing skbs and put to the correct tx_queue
as long as mapping remains the same. If the mapping changes, it posts
earlier skbs (with the correct lock) and then iterates for the other
skbs that have the next different mapping, and so on.

(This is required only if driver is supposed to transmit 1 skb in one
call, otherwise it is not an issue)

Alternatively, supporting drivers could return a different code on
mapping change, like: NETDEV_TX_MAPPING_CHANGED (for batching only)
so that qdisc_run() could retry. Would that work?

Secondly having xmit_win per queue: would it help in multiple skb
case? Currently there is no way to tell qdisc to dequeue skbs from
a particular band - it returns skb from highest priority band.

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller

From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:28:27 +0530

 Isn't it enough that the multiqueue+batching drivers handle skbs
 belonging to different queue's themselves, instead of qdisc having
 to figure that out? This will reduce costs for most skbs that are
 neither batched nor sent to multiqueue devices.

 Eg, driver can keep processing skbs and put to the correct tx_queue
 as long as mapping remains the same. If the mapping changes, it posts
 earlier skbs (with the correct lock) and then iterates for the other
 skbs that have the next different mapping, and so on.

The complexity in most of these suggestions is beginning to drive me a
bit crazy :-)

This should be the simplest thing in the world, when TX queue has
space, give it packets.  Period.

When I hear suggestions like have the driver pick the queue in
-hard_start_xmit() and return some special status if the queue
becomes different.  you know, I really begin to wonder :-)

If we have to go back, get into the queueing layer locks, have these
special cases, and whatnot, what's the point?

This code should eventually be able to run lockless all the way to the
TX queue handling code of the driver.  The queueing code should know
what TX queue the packet will be bound for, and always precisely
invoke the driver in a state where the driver can accept the packet.

Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
it.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Krishna Kumar2

Hi Dave,

David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:

  Isn't it enough that the multiqueue+batching drivers handle skbs
  belonging to different queue's themselves, instead of qdisc having
  to figure that out? This will reduce costs for most skbs that are
  neither batched nor sent to multiqueue devices.
 
  Eg, driver can keep processing skbs and put to the correct tx_queue
  as long as mapping remains the same. If the mapping changes, it posts
  earlier skbs (with the correct lock) and then iterates for the other
  skbs that have the next different mapping, and so on.

 The complexity in most of these suggestions is beginning to drive me a
 bit crazy :-)

 This should be the simplest thing in the world, when TX queue has
 space, give it packets.  Period.

 When I hear suggestions like have the driver pick the queue in
 -hard_start_xmit() and return some special status if the queue
 becomes different.  you know, I really begin to wonder :-)

 If we have to go back, get into the queueing layer locks, have these
 special cases, and whatnot, what's the point?

I understand your point, but the qdisc code itself needs almost no
change, as small as:

qdisc_restart()
{
  ...
  case NETDEV_TX_MAPPING_CHANGED:
  /*
   * Driver sent some skbs from one mapping, and found others
   * are for different queue_mapping. Try again.
   */
  ret = 1; /* guaranteed to have atleast 1 skb in batch list */
  break;
  ...
}

Alternatively if the driver does all the dirty work, qdisc needs no
change at all. However, I am not sure if this addresses all the
concerns raised by you, Peter, Jamal, others.

 This code should eventually be able to run lockless all the way to the
 TX queue handling code of the driver.  The queueing code should know
 what TX queue the packet will be bound for, and always precisely
 invoke the driver in a state where the driver can accept the packet.

This sounds like a good idea :)

I need to think more on this, esp as my batching sends multiple skbs of
possibly different mappings to device, and those skbs stay in batch list
if driver couldn't send them out.

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Krishna Kumar2

David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:

 Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
 it.

Great, this will make life easy. Any idea how long that would take?
It seems simple enough to do.

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller

From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:51:14 +0530

 David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:

  Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
  it.

 Great, this will make life easy. Any idea how long that would take?
 It seems simple enough to do.

I'd say we can probably try to get rid of it in 2.6.25, this is
assuming we get driver authors to cooperate and do the conversions
or alternatively some other motivated person.

I can just threaten to do them all and that should get the driver
maintainers going :-)

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] question regarding umad_recv

2007-10-09 Thread Hal Rosenstock

On Tue, 2007-10-09 at 13:01 +0530, Sumit Gaur - Sun Microsystem wrote:
 Hi,
 
 It is regarding *umad_recv* function of libibumad/src/umad.c file. Is it not 
 possible to recv MAD specific to GSI or SMI type. As per my impression if I 
 have 
 two separate threads to send and receive then I could send MADs to different 
 qp 
 0 or 1 depend on GSI and SMI MAD. But receiving has no control over it. 
 Please 
 suggest if there is any workaround for it.

See umad_register().

-- Hal

 
 Thanks and Regards
 sumit
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Jeff Garzik


David Miller wrote:

From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:51:14 +0530


David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:


Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
it.

Great, this will make life easy. Any idea how long that would take?
It seems simple enough to do.


I'd say we can probably try to get rid of it in 2.6.25, this is
assuming we get driver authors to cooperate and do the conversions
or alternatively some other motivated person.

I can just threaten to do them all and that should get the driver
maintainers going :-)


What, like this?  :)

Jeff




 drivers/net/atl1/atl1_main.c   |   16 +---
 drivers/net/chelsio/cxgb2.c|1 -
 drivers/net/chelsio/sge.c  |   20 +---
 drivers/net/e1000/e1000_main.c |6 +-
 drivers/net/ixgb/ixgb_main.c   |   24 
 drivers/net/pasemi_mac.c   |2 +-
 drivers/net/rionet.c   |   19 +++
 drivers/net/spider_net.c   |2 +-
 drivers/net/sungem.c   |   17 ++---
 drivers/net/tehuti.c   |   12 +---
 drivers/net/tehuti.h   |3 +--
 11 files changed, 32 insertions(+), 90 deletions(-)

diff --git a/drivers/net/atl1/atl1_main.c b/drivers/net/atl1/atl1_main.c
index 4c728f1..03e94fe 100644
--- a/drivers/net/atl1/atl1_main.c
+++ b/drivers/net/atl1/atl1_main.c
@@ -1665,10 +1665,7 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct 
net_device *netdev)
 
len -= skb-data_len;
 
-   if (unlikely(skb-len == 0)) {
-   dev_kfree_skb_any(skb);
-   return NETDEV_TX_OK;
-   }
+   WARN_ON(skb-len == 0);
 
param.data = 0;
param.tso.tsopu = 0;
@@ -1703,11 +1700,7 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct 
net_device *netdev)
}
}
 
-   if (!spin_trylock_irqsave(adapter-lock, flags)) {
-   /* Can't get lock - tell upper layer to requeue */
-   dev_printk(KERN_DEBUG, adapter-pdev-dev, tx locked\n);
-   return NETDEV_TX_LOCKED;
-   }
+   spin_lock_irqsave(adapter-lock, flags);
 
if (atl1_tpd_avail(adapter-tpd_ring)  count) {
/* not enough descriptors */
@@ -1749,8 +1742,11 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct 
net_device *netdev)
atl1_tx_map(adapter, skb, 1 == val);
atl1_tx_queue(adapter, count, param);
netdev-trans_start = jiffies;
+
spin_unlock_irqrestore(adapter-lock, flags);
+
atl1_update_mailbox(adapter);
+
return NETDEV_TX_OK;
 }
 
@@ -2301,8 +2297,6 @@ static int __devinit atl1_probe(struct pci_dev *pdev,
 */
/* netdev-features |= NETIF_F_TSO; */
 
-   netdev-features |= NETIF_F_LLTX;
-
/*
 * patch for some L1 of old version,
 * the final version of L1 may not need these
diff --git a/drivers/net/chelsio/cxgb2.c b/drivers/net/chelsio/cxgb2.c
index 2dbf8dc..0aba7e7 100644
--- a/drivers/net/chelsio/cxgb2.c
+++ b/drivers/net/chelsio/cxgb2.c
@@ -1084,7 +1084,6 @@ static int __devinit init_one(struct pci_dev *pdev,
netdev-mem_end = mmio_start + mmio_len - 1;
netdev-priv = adapter;
netdev-features |= NETIF_F_SG | NETIF_F_IP_CSUM;
-   netdev-features |= NETIF_F_LLTX;
 
adapter-flags |= RX_CSUM_ENABLED | TCP_CSUM_CAPABLE;
if (pci_using_dac)
diff --git a/drivers/net/chelsio/sge.c b/drivers/net/chelsio/sge.c
index ffa7e64..84f5869 100644
--- a/drivers/net/chelsio/sge.c
+++ b/drivers/net/chelsio/sge.c
@@ -1739,8 +1739,7 @@ static int t1_sge_tx(struct sk_buff *skb, struct adapter 
*adapter,
struct cmdQ *q = sge-cmdQ[qid];
unsigned int credits, pidx, genbit, count, use_sched_skb = 0;
 
-   if (!spin_trylock(q-lock))
-   return NETDEV_TX_LOCKED;
+   spin_lock(q-lock);
 
reclaim_completed_tx(sge, q);
 
@@ -1817,12 +1816,12 @@ use_sched:
}
 
if (use_sched_skb) {
-   if (spin_trylock(q-lock)) {
-   credits = q-size - q-in_use;
-   skb = NULL;
-   goto use_sched;
-   }
+   spin_lock(q-lock);
+   credits = q-size - q-in_use;
+   skb = NULL;
+   goto use_sched;
}
+
return NETDEV_TX_OK;
 }
 
@@ -1977,13 +1976,12 @@ static void sge_tx_reclaim_cb(unsigned long data)
for (i = 0; i  SGE_CMDQ_N; ++i) {
struct cmdQ *q = sge-cmdQ[i];
 
-   if (!spin_trylock(q-lock))
-   continue;
+   spin_lock(q-lock);
 
reclaim_completed_tx(sge, q);
-   if (i == 0  q-in_use) {/* flush pending credits */
+   if (i == 0  q-in_use)/* flush pending credits */
writel(F_CMDQ0_ENABLE,

[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Herbert Xu

On Tue, Oct 09, 2007 at 08:44:25AM -0400, Jeff Garzik wrote:
 David Miller wrote:
 
 I can just threaten to do them all and that should get the driver
 maintainers going :-)
 
 What, like this?  :)

Awsome :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Jeff Garzik


Herbert Xu wrote:

On Tue, Oct 09, 2007 at 08:44:25AM -0400, Jeff Garzik wrote:

David Miller wrote:

I can just threaten to do them all and that should get the driver
maintainers going :-)

What, like this?  :)


Awsome :)


Note my patch is just to get the maintainers going.  :)  I'm not going 
to commit that, since I don't have any way to test any of the drivers I 
touched (but I wouldn't scream if it appeared in net-2.6.24 either)


Jeff


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread jamal

On Tue, 2007-09-10 at 08:39 +0530, Krishna Kumar2 wrote:

 Driver might ask for 10 and we send 10, but LLTX driver might fail to get
 lock and return TX_LOCKED. I haven't seen your code in greater detail, but
 don't you requeue in that case too?

For others drivers that are non-batching and LLTX, it is possible - at
the moment in my patch i whine that the driver is buggy. I will fix this
up so it checks for NETIF_F_BTX. Thanks for pointing the above use case.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCHES] TX batching

2007-10-09 Thread jamal

On Tue, 2007-09-10 at 13:44 +0530, Krishna Kumar2 wrote:

 My feeling is that since the approaches are very different, 

My concern is the approaches are different only for short periods of
time. For example, I do requeueing, have xmit_win, have -end_xmit,
do batching from core etc; if you see value in any of these concepts,
they will appear in your patches and this goes on a loop. Perhaps what
we need is a referee and use our energies in something more positive.

 it would be a good idea to test the two for performance. 

Which i dont mind as long as it has an analysis that goes with it.
If all you post is heres what netperf showed, it is not useful at all.
There are also a lot of affecting variables. For example, is the
receiver a bottleneck? To make it worse, I could demonstrate to you that
if i slowed down the driver and allowed more packets to queue up on the
qdisc, batching will do well. In the past my feeling is you glossed over
such details and i am sucker for things like that - hence the conflict.

 Do you mind me doing
 that? Ofcourse others and/or you are more than welcome to do the same.
 
 I had sent a note to you yesterday about this, please let me know
 either way.
 

I responded to you - but it may have been lost in the noise; heres a
copy:
http://marc.info/?l=linux-netdevm=119185137124008w=2

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

2007-10-09 Thread James Lentini


On Mon, 8 Oct 2007, Steve Wise wrote:

 The correct solution, IMO, is to enhance the core low level 4-tuple 
 allocation services to be more generic (eg: not be tied to a struct 
 sock).  Then the host tcp stack and the host rdma stack can allocate 
 TCP/iWARP ports/4tuples from this common exported service and share 
 the port space.  This allocation service could also be used by other 
 deep adapters like iscsi adapters if needed.

As a developer of an RDMA ULP, NFS-RDMA, I like this approach because 
it will simplify the configuration of an RDMA device and the services 
that use it.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24

2007-10-09 Thread Steve Wise




Roland Dreier wrote:

  No mention about the iwarp port space issue?

I don't think we're at a stage where I'm prepared to merge something--
we all agree the latest patch has serious drawbacks, and it commits us
to a suboptimal interface that is userspace-visible.



Fair enough.


  I'm at a loss as to how to proceed.

Could we try to do some cleanups to the net core to make the alias
stuff less painful?  eg is there any sane way to make it possible for
a device that creates 'eth0' to also create an 'iw0' alias without an
assigning an address?



Well, alias interfaces really don't exist.  ethX:iw is really just 
adding a address record (struct in_ifaddr) to ethX.  So in the current 
core design, adding an alias without an address is really adding the 
alias with address 0.0.0.0.  And I think the core net code assumes if an 
in_ifaddr struct exists for a device, then its IP address is indeed valid.


So I think the changes wouldn't be small to enhance the design to add a 
concept of an alias interface.


I'll look into this more though.

Steve.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: parallel networking

2007-10-09 Thread Michael Krause

At 06:53 PM 10/8/2007, Jeff Garzik wrote:

David Miller wrote:

From: Jeff Garzik [EMAIL PROTECTED]
Date: Mon, 08 Oct 2007 10:22:28 -0400

In terms of overall parallelization, both for TX as well as RX, my gut 
feeling is that we want to move towards an MSI-X, multi-core friendly 
model where packets are LIKELY to be sent and received by the same set 
of [cpus | cores | packages | nodes] that the [userland] processes 
dealing with the data.

The problem is that the packet schedulers want global guarantees
on packet ordering, not flow centric ones.
That is the issue Jamal is concerned about.

Oh, absolutely.

I think, fundamentally, any amount of cross-flow resource management done 
in software is an obstacle to concurrency.

That's not a value judgement, just a statement of fact.

Correct.

traffic cops are intentional bottlenecks we add to the process, to 
enable features like priority flows, filtering, or even simple socket 
fairness guarantees.  Each of those bottlenecks serves a valid purpose, 
but at the end of the day, it's still a bottleneck.

So, improving concurrency may require turning off useful features that 
nonetheless hurt concurrency.

Software needs to get out of the main data path - another fact of life.

The more I think about it, the more inevitable it seems that we really
might need multiple qdiscs, one for each TX queue, to pull this full
parallelization off.
But the semantics of that don't smell so nice either.  If the user
attaches a new qdisc to ethN, does it go to all the TX queues, or
what?
All of the traffic shaping technology deals with the device as a unary
object.  It doesn't fit to multi-queue at all.

Well the easy solutions to networking concurrency are

* use virtualization to carve up the machine into chunks

* use multiple net devices

Since new NIC hardware is actively trying to be friendly to 
multi-channel/virt scenarios, either of these is reasonably 
straightforward given the current state of the Linux net stack.  Using 
multiple net devices is especially attractive because it works very well 
with the existing packet scheduling.

Both unfortunately impose a burden on the developer and admin, to force 
their apps to distribute flows across multiple [VMs | net devs].

Not the most optimal approach.

The third alternative is to use a single net device, with SMP-friendly 
packet scheduling.  Here you run into the problems you described device 
as a unary object etc. with the current infrastructure.

With multiple TX rings, consider that we are pushing the packet scheduling 
from software to hardware...  which implies

* hardware-specific packet scheduling
* some TC/shaping features not available, because hardware doesn't support it

For a number of years now, we have designed interconnects to support a 
reasonable range of arbitration capabilities among hardware resource 
sets.  With reasonable classification by software to identify a hardware 
resource sets (ideally interpretation of the application's view of its 
priority combined with policy management software that determines how that 
should map among competing application views), one can eliminate most of 
the CPU cycles spent into today's implementations.   I and others presented 
a number of these concepts many years ago during the development which 
eventually led to IB and iWARP.

- Each resource set can be assigned to a unique PCIe function or a function 
group to enable function / group arbitration to the PCIe link.

- Each resource set can be assigned to a unique PCIe TC and with improved 
ordering hints (coming soon) can be used to eliminate false ordering 
dependencies.

- Each resource set can be assigned to a unique IB TC / SL or iWARP 802.1p 
to signal priority.  These can then be used to program respective link 
arbitration as well as path selection to enable multi-path load balancing.

- Many IHV have picked up on the arbitration capabilities and extended them 
as shown years ago by a number of us to enable resource set arbitration and 
a variety of QoS based policies.  If software defines a reasonable (i.e. 
small) number of management and control knobs, then these can be easily 
mapped to most h/w implementations.   Some of us are working on how to do 
this for virtualized environments and I expect these to be applicable to 
all environments in the end.

One other key item to keep in mind is that unless there is contention in 
the system, the majority of the QoS mechanisms are meaningless and in a 
very large percentage of customer environments, they simply don't scale 
with device and interconnect performance.   Many applications in fact 
remain processor / memory constrained and therefore do not stress the I/O 
subsystem or the external interconnects making most of the software 
mechanisms rather moot in real customer environments.   Simple truth is it 
is nearly always cheaper to over-provision the I/O / interconnects than to 
use the software approach which while quite

[ofa-general] SDP ?

2007-10-09 Thread Jim Langston


Hi all,

I'm working on porting SDP to OpenSolaris and am looking at a
compile error that I get. Essentially, I have a conflict of types on
the compile:

bash-3.00$ /opt/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I. -I.. -g 
-D_POSIX_PTHREAD_SEMANTICS -DSYSCONFDIR=\/usr/local/etc\ -g 
-D_POSIX_PTHREAD_SEMANTICS -c port.c  -KPIC -DPIC -o .libs/port.o

port.c, line 1896: identifier redeclared: getsockname
   current : function(int, pointer to struct sockaddr {unsigned 
short sa_family, array[14] of char sa_data}, pointer to unsigned int) 
returning int
   previous: function(int, pointer to struct sockaddr {unsigned 
short sa_family, array[14] of char sa_data}, pointer to void) returning 
int : /usr/include/sys/socket.h, line 436



Line 436 in /usr/include/sys/socket.h

extern int getsockname(int, struct sockaddr *_RESTRICT_KYWD, Psocklen_t);


and Psocklen_t

#if defined(_XPG4_2) || defined(_BOOT)
typedef socklen_t   *_RESTRICT_KYWD Psocklen_t;
#else
typedef void*_RESTRICT_KYWD Psocklen_t;
#endif  /* defined(_XPG4_2) || defined(_BOOT) */


Do I need to change port.c getsockname to type void * ?


Thanks,

Jim
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24

2007-10-09 Thread Roland Dreier

  Roland, I submitted an updated patch incorporating some of Sean's comments 
  within
  a day or two. Rest of comments pertained to restructuring the code and 
  adding 
  some additional module parameters.
  
  This would require more discussions since some of these had been already 
  discussed
  previously. We had decided upon this code structure after a lot of 
  discussions and
  incorporating these would be undoing some of that.

Can you give a link to your current final version of the patch?

Sean, what's your opinion of where we stand?

Since module parameters create a userspace-visible interface that we
are stuck with for a long time, we definitely have to get at least
that much right before merging.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] librdmacm feature request

2007-10-09 Thread Roland Dreier

  It shouldn't be too hard.  Assuming you handle the modify channel as a
  synchronous action, the thread calling modify channel can't also be in
  rdma_get_cm_event at the same time.  So, if you get there and someone is
  blocking on that channel and just hasn't been scheduled to run yet, then
  leave the event where it is while you switch the channel and send new
  events to the new channel.  If they aren't then move any pending events
  to the new channel as you do the change.

Hmm, how do you move events?  Keep in mind that there may be an
arbitrary number of pending events that belong to other cm_ids that
are queued before the events you want to move.  And you can't really
do anything too funky with the event channel fd, because you don't
want to mess up some other thread that might be waiting for events in
poll() or whatever.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH] core: Check that the function reg_phys_mr is not NULL before executing it

2007-10-09 Thread Dotan Barak

Check that the function reg_phys_mr is not NULL before executing it.

There are devices (for example: mlx4) that their low level driver 
doesn't support this verb, so this patch will prevent kernel oops on them.

Signed-off-by: Dotan Barak [EMAIL PROTECTED]

---

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 86ed8af..e2d54cb 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -672,6 +672,9 @@ struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd,
 {
struct ib_mr *mr;
 
+   if (!pd-device-reg_phys_mr)
+   return -ENOSYS;
+
mr = pd-device-reg_phys_mr(pd, phys_buf_array, num_phys_buf,
 mr_access_flags, iova_start);
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RLIMIT_MEMLOCK

2007-10-09 Thread Adam Miller

We have run into this problem with using mpiexec.  SLES 10 is on the 
cluster and we have set the limits under /etc/security/limits.conf and 
they work there, even when we run mpirun commands work fine but when 
tying them all in using mpiexec it still comes back with the 32K limit 
in memory.


Any and all users can log in and in bash type ulimit -a and tcsh type 
limit and both state the correct full memory limits, but when using 
mpiexec under both shells they get the 32k limit.


Any suggestions?

thanks

--
Adam Miller
The College of William and Mary
Virginia Institute of Marine Science 
-Infrastructure Services Architect-

-Information Technology and Networking Services-
Watermens Hall
Mail: P.O. Box 1346
Deliveries: Route 1208, Greate Road
Gloucester Point, VA  23062-1346, USA
p(804)684-7077
f(804)684-7097
email: [EMAIL PROTECTED]
email cell: [EMAIL PROTECTED]

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [ewg] libcxgb3-1.0.3 available for ofed-1.2.5 and ofed-1.3

2007-10-09 Thread Vladimir Sokolovsky


Steve Wise wrote:

Vlad/Tziporet,

Can you please pull version 1.0.3 of libcxgb3 for inclusion in 
ofed-1.2.5 and ofed-1.3?  It contains a bug fix for olders kernels like 
RHEL4U4.  You can use the master branch for both releases:


git://git.openfabrics.org/~swise/libcxgb3.git master

Also, please update the spec file you're using to reflect the release 
(1.0.3). The spec file in the libcxgb3 git tree should be correct.



Thanks,

Steve.



Done,

Regards,
Vladimir
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [ewg] libcxgb3-1.0.3 available for ofed-1.2.5 and ofed-1.3

2007-10-09 Thread Steve Wise


Thanks Vlad,

Can you crank a ofed-1.2.5 development build too?

Thanks,

Steve.


Vladimir Sokolovsky wrote:

Steve Wise wrote:

Vlad/Tziporet,

Can you please pull version 1.0.3 of libcxgb3 for inclusion in 
ofed-1.2.5 and ofed-1.3?  It contains a bug fix for olders kernels 
like RHEL4U4.  You can use the master branch for both releases:


git://git.openfabrics.org/~swise/libcxgb3.git master

Also, please update the spec file you're using to reflect the release 
(1.0.3). The spec file in the libcxgb3 git tree should be correct.



Thanks,

Steve.



Done,

Regards,
Vladimir

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Andi Kleen

David Miller [EMAIL PROTECTED] writes:
 
 2) Switch the default qdisc away from pfifo_fast to a new DRR fifo
with load balancing using the code in #1.  I think this is kind
of in the territory of what Peter said he is working on.

Hopefully that new qdisc will just use the TX rings of the hardware
directly. They are typically large enough these days. That might avoid
some locking in this critical path.

I know this is controversial, but realistically I doubt users
benefit at all from the prioritization that pfifo provides.

I agree. For most interfaces the priority is probably dubious.
Even for DSL the prioritization will be likely usually done in a router
these days.

Also for the fast interfaces where we do TSO priority doesn't work
very well anyways -- with large packets there is not too much 
to prioritize.

 3) Work on discovering a way to make the locking on transmit as
localized to the current thread of execution as possible.  Things
like RCU and statistic replication, techniques we use widely
elsewhere in the stack, begin to come to mind.

If the data is just passed on to the hardware queue, why is any 
locking needed at all? (except for the driver locking of course)

-Andi
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24

2007-10-09 Thread Sean Hefty


Can you give a link to your current final version of the patch?

Sean, what's your opinion of where we stand?


Let me look back over the last version that was sent and reply back 
later today or tomorrow.  Several of my initial comments were on code 
structure.



Since module parameters create a userspace-visible interface that we
are stuck with for a long time, we definitely have to get at least
that much right before merging.


I was taking a slightly different view of the design.  It would be nice 
to agree on whether SRQ should be separated from the QP type before 
merging upstream, even if the implementation doesn't immediately support 
all available options.


- Sean
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [ewg] libcxgb3-1.0.3 available for ofed-1.2.5 and ofed-1.3

2007-10-09 Thread Vladimir Sokolovsky


Steve Wise wrote:

Thanks Vlad,

Can you crank a ofed-1.2.5 development build too?

Thanks,

Steve.



Done:

http://www.openfabrics.org/builds/connectx/OFED-1.2.5-20071009-0955.tgz

Regards,
Vladimir
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 2.6.24] rdma/cm: fix deadlock destroying listen requests

2007-10-09 Thread Sean Hefty

Deadlock condition reported by Kanoj Sarcar [EMAIL PROTECTED]
The deadlock occurs when a connection request arrives at the same
time that a wildcard listen is being destroyed.

A wildcard listen maintains per device listen requests for each
RDMA device in the system.  The per device listens are automatically
added and removed when RDMA devices are inserted or removed from
the system.

When a wildcard listen is destroyed, rdma_destroy_id() acquires
the rdma_cm's device mutex ('lock') to protect against hot-plug
events adding or removing per device listens.  It then tries to
destroy the per device listens by calling ib_destroy_cm_id() or
iw_destroy_cm_id().  It does this while holding the device mutex.

However, if the underlying iw/ib CM reports a connection request
while this is occurring, the rdma_cm callback function will try
to acquire the same device mutex.  Since we're in a callback,
the ib_destroy_cm_id() or iw_destroy_cm_id() calls will block until
their callback thread returns, but the callback is blocked waiting for
the device mutex.

Fix this by re-working how per device listens are destroyed.  Use
rdma_destroy_id(), which avoids the deadlock, in place of
cma_destroy_listen().  Additional synchronization is added
to handle device hot-plug events and ensure that the id is not destroyed
twice.

Signed-off-by: Sean Hefty [EMAIL PROTECTED]
---
Fix from discussion started at:
http://lists.openfabrics.org/pipermail/general/2007-October/041456.html

Kanoj, please verify that this fix looks correct and works for you, and I
will queue for 2.6.24.

 drivers/infiniband/core/cma.c |   70 +
 1 files changed, 23 insertions(+), 47 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 9ffb998..21ea92c 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -113,11 +113,12 @@ struct rdma_id_private {
 
struct rdma_bind_list   *bind_list;
struct hlist_node   node;
-   struct list_headlist;
-   struct list_headlisten_list;
+   struct list_headlist; /* listen_any_list or cma_device.list */
+   struct list_headlisten_list; /* per device listens */
struct cma_device   *cma_dev;
struct list_headmc_list;
 
+   int internal_id;
enum cma_state  state;
spinlock_t  lock;
struct completion   comp;
@@ -715,50 +716,27 @@ static void cma_cancel_route(struct rdma_id_private 
*id_priv)
}
 }
 
-static inline int cma_internal_listen(struct rdma_id_private *id_priv)
-{
-   return (id_priv-state == CMA_LISTEN)  id_priv-cma_dev 
-  cma_any_addr(id_priv-id.route.addr.src_addr);
-}
-
-static void cma_destroy_listen(struct rdma_id_private *id_priv)
-{
-   cma_exch(id_priv, CMA_DESTROYING);
-
-   if (id_priv-cma_dev) {
-   switch (rdma_node_get_transport(id_priv-id.device-node_type)) 
{
-   case RDMA_TRANSPORT_IB:
-   if (id_priv-cm_id.ib  !IS_ERR(id_priv-cm_id.ib))
-   ib_destroy_cm_id(id_priv-cm_id.ib);
-   break;
-   case RDMA_TRANSPORT_IWARP:
-   if (id_priv-cm_id.iw  !IS_ERR(id_priv-cm_id.iw))
-   iw_destroy_cm_id(id_priv-cm_id.iw);
-   break;
-   default:
-   break;
-   }
-   cma_detach_from_dev(id_priv);
-   }
-   list_del(id_priv-listen_list);
-
-   cma_deref_id(id_priv);
-   wait_for_completion(id_priv-comp);
-
-   kfree(id_priv);
-}
-
 static void cma_cancel_listens(struct rdma_id_private *id_priv)
 {
struct rdma_id_private *dev_id_priv;
 
+   /*
+* Remove from listen_any_list to prevent added devices from spawning
+* additional listen requests.
+*/
mutex_lock(lock);
list_del(id_priv-list);
 
while (!list_empty(id_priv-listen_list)) {
dev_id_priv = list_entry(id_priv-listen_list.next,
 struct rdma_id_private, listen_list);
-   cma_destroy_listen(dev_id_priv);
+   /* sync with device removal to avoid duplicate destruction */
+   list_del_init(dev_id_priv-list);
+   list_del(dev_id_priv-listen_list);
+   mutex_unlock(lock);
+
+   rdma_destroy_id(dev_id_priv-id);
+   mutex_lock(lock);
}
mutex_unlock(lock);
 }
@@ -846,6 +824,9 @@ void rdma_destroy_id(struct rdma_cm_id *id)
cma_deref_id(id_priv);
wait_for_completion(id_priv-comp);
 
+   if (id_priv-internal_id)
+   cma_deref_id(id_priv-id.context);
+
kfree(id_priv-id.route.path_rec);
kfree(id_priv);
 }
@@ -1401,14 +1382,13 @@ static void cma_listen_on_dev(struct rdma_id_private

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Stephen Hemminger

On 09 Oct 2007 18:51:51 +0200
Andi Kleen [EMAIL PROTECTED] wrote:

 David Miller [EMAIL PROTECTED] writes:
  
  2) Switch the default qdisc away from pfifo_fast to a new DRR fifo
 with load balancing using the code in #1.  I think this is kind
 of in the territory of what Peter said he is working on.
 
 Hopefully that new qdisc will just use the TX rings of the hardware
 directly. They are typically large enough these days. That might avoid
 some locking in this critical path.
 
 I know this is controversial, but realistically I doubt users
 benefit at all from the prioritization that pfifo provides.
 
 I agree. For most interfaces the priority is probably dubious.
 Even for DSL the prioritization will be likely usually done in a router
 these days.
 
 Also for the fast interfaces where we do TSO priority doesn't work
 very well anyways -- with large packets there is not too much 
 to prioritize.
 
  3) Work on discovering a way to make the locking on transmit as
 localized to the current thread of execution as possible.  Things
 like RCU and statistic replication, techniques we use widely
 elsewhere in the stack, begin to come to mind.
 
 If the data is just passed on to the hardware queue, why is any 
 locking needed at all? (except for the driver locking of course)
 
 -Andi

I wonder about the whole idea of queueing in general at such high speeds.
Given the normal bi-modal distribution of packets, and the predominance
of 1500 byte MTU; does it make sense to even have any queueing in software
at all?


-- 
Stephen Hemminger [EMAIL PROTECTED]
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Andi Kleen

 I wonder about the whole idea of queueing in general at such high speeds.
 Given the normal bi-modal distribution of packets, and the predominance
 of 1500 byte MTU; does it make sense to even have any queueing in software
 at all?

Yes that is my point -- it should just pass it through directly
and the driver can then put it into the different per CPU (or per
whatever) queues managed by the hardware.

The only thing the qdisc needs to do is to set some bit that says
it is ok to put this into difference queues; don't need strict ordering 

Otherwise if the drivers did that unconditionally they might cause
problems with other qdiscs.

This would also require that the driver exports some hint 
to the upper layer on how large its internal queues are. A device
with a short queue would still require pfifo_fast. Long queue
devices could just pass through. That again could be a single flag.

-Andi
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroying listen requests

2007-10-09 Thread Kanoj Sarcar


Sean,

I will take a look at your code changes and comment, and hopefully be 
able to run a quick test on your patch within this week.


Just so I understand, did you discover problems (maybe preexisting race 
conditions) with my previously posted patch? If yes, please point it 
out, so its easier to review yours; if not, I will assume your patch 
implements a better locking scheme and review it as such.


Thanks.

Kanoj

Sean Hefty wrote:


Deadlock condition reported by Kanoj Sarcar [EMAIL PROTECTED]
The deadlock occurs when a connection request arrives at the same
time that a wildcard listen is being destroyed.

A wildcard listen maintains per device listen requests for each
RDMA device in the system.  The per device listens are automatically
added and removed when RDMA devices are inserted or removed from
the system.

When a wildcard listen is destroyed, rdma_destroy_id() acquires
the rdma_cm's device mutex ('lock') to protect against hot-plug
events adding or removing per device listens.  It then tries to
destroy the per device listens by calling ib_destroy_cm_id() or
iw_destroy_cm_id().  It does this while holding the device mutex.

However, if the underlying iw/ib CM reports a connection request
while this is occurring, the rdma_cm callback function will try
to acquire the same device mutex.  Since we're in a callback,
the ib_destroy_cm_id() or iw_destroy_cm_id() calls will block until
their callback thread returns, but the callback is blocked waiting for
the device mutex.

Fix this by re-working how per device listens are destroyed.  Use
rdma_destroy_id(), which avoids the deadlock, in place of
cma_destroy_listen().  Additional synchronization is added
to handle device hot-plug events and ensure that the id is not destroyed
twice.

Signed-off-by: Sean Hefty [EMAIL PROTECTED]
---
Fix from discussion started at:
http://lists.openfabrics.org/pipermail/general/2007-October/041456.html

Kanoj, please verify that this fix looks correct and works for you, and I
will queue for 2.6.24.

drivers/infiniband/core/cma.c |   70 +
1 files changed, 23 insertions(+), 47 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 9ffb998..21ea92c 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -113,11 +113,12 @@ struct rdma_id_private {

struct rdma_bind_list   *bind_list;
struct hlist_node   node;
-   struct list_headlist;
-   struct list_headlisten_list;
+   struct list_headlist; /* listen_any_list or cma_device.list */
+   struct list_headlisten_list; /* per device listens */
struct cma_device   *cma_dev;
struct list_headmc_list;

+   int internal_id;
enum cma_state  state;
spinlock_t  lock;
struct completion   comp;
@@ -715,50 +716,27 @@ static void cma_cancel_route(struct rdma_id_private 
*id_priv)
}
}

-static inline int cma_internal_listen(struct rdma_id_private *id_priv)
-{
-   return (id_priv-state == CMA_LISTEN)  id_priv-cma_dev 
-  cma_any_addr(id_priv-id.route.addr.src_addr);
-}
-
-static void cma_destroy_listen(struct rdma_id_private *id_priv)
-{
-   cma_exch(id_priv, CMA_DESTROYING);
-
-   if (id_priv-cma_dev) {
-   switch (rdma_node_get_transport(id_priv-id.device-node_type)) 
{
-   case RDMA_TRANSPORT_IB:
-   if (id_priv-cm_id.ib  !IS_ERR(id_priv-cm_id.ib))
-   ib_destroy_cm_id(id_priv-cm_id.ib);
-   break;
-   case RDMA_TRANSPORT_IWARP:
-   if (id_priv-cm_id.iw  !IS_ERR(id_priv-cm_id.iw))
-   iw_destroy_cm_id(id_priv-cm_id.iw);
-   break;
-   default:
-   break;
-   }
-   cma_detach_from_dev(id_priv);
-   }
-   list_del(id_priv-listen_list);
-
-   cma_deref_id(id_priv);
-   wait_for_completion(id_priv-comp);
-
-   kfree(id_priv);
-}
-
static void cma_cancel_listens(struct rdma_id_private *id_priv)
{
struct rdma_id_private *dev_id_priv;

+   /*
+* Remove from listen_any_list to prevent added devices from spawning
+* additional listen requests.
+*/
mutex_lock(lock);
list_del(id_priv-list);

while (!list_empty(id_priv-listen_list)) {
dev_id_priv = list_entry(id_priv-listen_list.next,
 struct rdma_id_private, listen_list);
-   cma_destroy_listen(dev_id_priv);
+   /* sync with device removal to avoid duplicate destruction */
+   list_del_init(dev_id_priv-list);
+   list_del(dev_id_priv-listen_list);
+   mutex_unlock(lock);
+
+   rdma_destroy_id(dev_id_priv-id);
+

Re: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24

2007-10-09 Thread Pradeep Satyanarayana

Roland Dreier wrote:
   Roland, I submitted an updated patch incorporating some of Sean's comments 
 within
   a day or two. Rest of comments pertained to restructuring the code and 
 adding 
   some additional module parameters.
   
   This would require more discussions since some of these had been already 
 discussed
   previously. We had decided upon this code structure after a lot of 
 discussions and
   incorporating these would be undoing some of that.
 
 Can you give a link to your current final version of the patch?
 
Roland,

This is the link to the last one that I submitted on 09/18.

http://lists.openfabrics.org/pipermail/general/2007-September/040917.html

Pradeep

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Waskiewicz Jr, Peter P

 IMO the net driver really should provide a hint as to what it wants.
 
 8139cp and tg3 would probably prefer multiple TX queue 
 behavior to match silicon behavior -- strict prio.

If I understand what you just said, I disagree.  If your hardware is
running strict prio, you don't want to enforce strict prio in the qdisc
layer; performing two layers of QoS is excessive, and may lead to
results you don't want.  The reason I added the DRR qdisc is for the Si
that has its own queueing strategy that is not RR.  For Si that
implements RR (like e1000), you can either use the DRR qdisc, or if you
want to prioritize your flows, use PRIO.

-PJ Waskiewicz
[EMAIL PROTECTED]
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Has libmlx4 been released?

2007-10-09 Thread Roland Dreier

  looking at git://git.kernel.org/pub/scm/libs/infiniband/libmlx4.git
  
  I don't see any tags or branches.

That's right, I haven't made any real release yet.

  If not, when is the initial release planned?

Soon I guess.  I don't know of any outstanding issues so it's just a
matter of doing a release.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Jeff Garzik


Waskiewicz Jr, Peter P wrote:

IMO the net driver really should provide a hint as to what it wants.

8139cp and tg3 would probably prefer multiple TX queue 
behavior to match silicon behavior -- strict prio.


If I understand what you just said, I disagree.  If your hardware is
running strict prio, you don't want to enforce strict prio in the qdisc
layer; performing two layers of QoS is excessive, and may lead to
results you don't want.  The reason I added the DRR qdisc is for the Si
that has its own queueing strategy that is not RR.  For Si that
implements RR (like e1000), you can either use the DRR qdisc, or if you
want to prioritize your flows, use PRIO.


A misunderstanding, I think.

To my brain, DaveM's item #2 seemed to assume/require the NIC hardware 
to balance fairly across hw TX rings, which seemed to preclude the 
8139cp/tg3 style of strict-prio hardware.  That's what I was responding to.


As long as there is some modular way to fit 8139cp/tg3 style multi-TX 
into our universe, I'm happy :)


Jeff



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroying listen requests

2007-10-09 Thread Sean Hefty

Just so I understand, did you discover problems (maybe preexisting race 
conditions) with my previously posted patch? If yes, please point it 
out, so its easier to review yours; if not, I will assume your patch 
implements a better locking scheme and review it as such.


I tried to explain the issue somewhat in my change commit and code 
comments.  The issue is synchronizing cleanup of the listen_list with 
device removal.


When an RDMA device is added to the system, a new listen request is 
added for all wildcard listens.  Since the original locking held the 
mutex throughout the cleanup of the listen list, it prevented adding 
another listen request during that same time.


Similar protection was there for handling device removal.  When a device 
is removed from the system, all internal listen requests associated with 
that device are destroyed.  If the associated wildcard listen is also 
being destroyed, we need to ensure that we don't try to destroy the same 
listen twice.


My patch, like yours, ends up releasing the mutex while cleaning up the 
listen_list.  I choose to eliminate the cma_destroy_listen() call, and 
use rdma_destroy_id() as a single destruction path instead.  This keeps 
the locking contained to a single function.  (I don't like acquiring a 
lock in one call and releasing it in another.  It puts too much 
assumption on the caller.)


What was missing was ensuring that a device removal didn't try to 
destroy the same listen request.  This is handled by the adding the 
list_del*() calls to cma_cancel_listens().  Whichever thread removes the 
listening id from the device list is responsible for its destruction. 
And because that thread could be the device removal thread, I added a 
reference from the per device listen to the wildcard listen.


Hopefully this makes sense.

- Sean
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH] IB/ipath -- patches for 2.6.24

2007-10-09 Thread Arthur Jones

hi roland, here is our current batch of patches.
i realize that they are a bit later than you would
probably like, i'm sorry about that -- i hope they
are straightforward enough to make it into your
for-2.6.24 branch.

these patches can be git pulled from:

git://git.qlogic.com/ipath-linux-2.6 for-roland

arthur
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 01/23] IB/ipath -- iba6110 rev4 GPIO counters support

2007-10-09 Thread Arthur Jones

On iba6110 rev4, support for three more IB counters
were added.  The LocalLinkIntegrityError counter,
the ExcessiveBufferOverrunErrors counter and support
for error counting of flow control packets on an
invalid VL.  These counters trigger GPIO interrupts
and the sw keeps track of the counts.  Since we also
use GPIO interrupts to signal packet reception, we
need to turn off the fast interrupts, or we risk losing
a GPIO interrupt.

Signed-off-by: Arthur Jones [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_iba6110.c |8 
 drivers/infiniband/hw/ipath/ipath_intr.c|4 ++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c 
b/drivers/infiniband/hw/ipath/ipath_iba6110.c
index 650745d..e1c5998 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba6110.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c
@@ -1559,6 +1559,14 @@ static int ipath_ht_early_init(struct ipath_devdata *dd)
ipath_dev_err(dd, Unsupported InfiniPath serial 
  number %.16s!\n, dd-ipath_serial);
 
+   if (dd-ipath_minrev = 4) {
+   /* Rev4+ reports extra errors via internal GPIO pins */
+   dd-ipath_flags |= IPATH_GPIO_ERRINTRS;
+   dd-ipath_gpio_mask |= IPATH_GPIO_ERRINTR_MASK;
+   ipath_write_kreg(dd, dd-ipath_kregs-kr_gpio_mask,
+dd-ipath_gpio_mask);
+   }
+
return 0;
 }
 
diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c 
b/drivers/infiniband/hw/ipath/ipath_intr.c
index b29fe7e..11b3614 100644
--- a/drivers/infiniband/hw/ipath/ipath_intr.c
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c
@@ -1085,8 +1085,8 @@ irqreturn_t ipath_intr(int irq, void *data)
 * GPIO_2 indicates (on some HT4xx boards) that a packet
 *has arrived for Port 0. Checking for this
 *is controlled by flag IPATH_GPIO_INTR.
-* GPIO_3..5 on IBA6120 Rev2 chips indicate errors
-*that we need to count. Checking for this
+* GPIO_3..5 on IBA6120 Rev2 and IBA6110 Rev4 chips indicate
+*errors that we need to count. Checking for this
 *is controlled by flag IPATH_GPIO_ERRINTRS.
 */
u32 gpiostatus;

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 02/23] IB/ipath - performance optimization for CPU differences

2007-10-09 Thread Arthur Jones

From: Ralph Campbell [EMAIL PROTECTED]

Different processors have different ordering restrictions for
write combining.  By taking advantage of this, we can eliminate
some write barriers when writing to the send buffers.

Signed-off-by: Ralph Campbell [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_diag.c|   22 +-
 drivers/infiniband/hw/ipath/ipath_iba6120.c |2 +
 drivers/infiniband/hw/ipath/ipath_kernel.h  |2 +
 drivers/infiniband/hw/ipath/ipath_verbs.c   |   62 ---
 4 files changed, 53 insertions(+), 35 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_diag.c 
b/drivers/infiniband/hw/ipath/ipath_diag.c
index cf25cda..4137c77 100644
--- a/drivers/infiniband/hw/ipath/ipath_diag.c
+++ b/drivers/infiniband/hw/ipath/ipath_diag.c
@@ -446,19 +446,21 @@ static ssize_t ipath_diagpkt_write(struct file *fp,
   dd-ipath_unit, plen - 1, pbufn);
 
if (dp.pbc_wd == 0)
-   /* Legacy operation, use computed pbc_wd */
dp.pbc_wd = plen;
-
-   /* we have to flush after the PBC for correctness on some cpus
-* or WC buffer can be written out of order */
writeq(dp.pbc_wd, piobuf);
-   ipath_flush_wc();
-   /* copy all by the trigger word, then flush, so it's written
+   /*
+* Copy all by the trigger word, then flush, so it's written
 * to chip before trigger word, then write trigger word, then
-* flush again, so packet is sent. */
-   __iowrite32_copy(piobuf + 2, tmpbuf, clen - 1);
-   ipath_flush_wc();
-   __raw_writel(tmpbuf[clen - 1], piobuf + clen + 1);
+* flush again, so packet is sent.
+*/
+   if (dd-ipath_flags  IPATH_PIO_FLUSH_WC) {
+   ipath_flush_wc();
+   __iowrite32_copy(piobuf + 2, tmpbuf, clen - 1);
+   ipath_flush_wc();
+   __raw_writel(tmpbuf[clen - 1], piobuf + clen + 1);
+   } else
+   __iowrite32_copy(piobuf + 2, tmpbuf, clen);
+
ipath_flush_wc();
 
ret = sizeof(dp);
diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c 
b/drivers/infiniband/hw/ipath/ipath_iba6120.c
index 5b6ac9a..a324c6f 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba6120.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c
@@ -1273,6 +1273,8 @@ static void ipath_pe_tidtemplate(struct ipath_devdata *dd)
 static int ipath_pe_early_init(struct ipath_devdata *dd)
 {
dd-ipath_flags |= IPATH_4BYTE_TID;
+   if (ipath_unordered_wc())
+   dd-ipath_flags |= IPATH_PIO_FLUSH_WC;
 
/*
 * For openfabrics, we need to be able to handle an IB header of
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h 
b/drivers/infiniband/hw/ipath/ipath_kernel.h
index 7a7966f..d983f92 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -724,6 +724,8 @@ int ipath_set_rx_pol_inv(struct ipath_devdata *dd, u8 
new_pol_inv);
 #define IPATH_LINKACTIVE0x200
/* link current state is unknown */
 #define IPATH_LINKUNK   0x400
+   /* Write combining flush needed for PIO */
+#define IPATH_PIO_FLUSH_WC  0x1000
/* no IB cable, or no device on IB cable */
 #define IPATH_NOCABLE   0x4000
/* Supports port zero per packet receive interrupts via
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c 
b/drivers/infiniband/hw/ipath/ipath_verbs.c
index 16aa61f..559d4a6 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -631,7 +631,7 @@ static inline u32 clear_upper_bytes(u32 data, u32 n, u32 
off)
 #endif
 
 static void copy_io(u32 __iomem *piobuf, struct ipath_sge_state *ss,
-   u32 length)
+   u32 length, unsigned flush_wc)
 {
u32 extra = 0;
u32 data = 0;
@@ -757,11 +757,14 @@ static void copy_io(u32 __iomem *piobuf, struct 
ipath_sge_state *ss,
}
/* Update address before sending packet. */
update_sge(ss, length);
-   /* must flush early everything before trigger word */
-   ipath_flush_wc();
-   __raw_writel(last, piobuf);
-   /* be sure trigger word is written */
-   ipath_flush_wc();
+   if (flush_wc) {
+   /* must flush early everything before trigger word */
+   ipath_flush_wc();
+   __raw_writel(last, piobuf);
+   /* be sure trigger word is written */
+   ipath_flush_wc();
+   } else
+   __raw_writel(last, piobuf);
 }
 
 /**
@@ -776,6 +779,7 @@ int ipath_verbs_send(struct ipath_devdata *dd, u32 hdrwords,
 u32 *hdr, u32 len, struct ipath_sge_state *ss)
 {
u32 __iomem *piobuf;
+   unsigned flush_wc;
u32 plen;
int ret;
 
@@ -799,47 +803,55 @@ int ipath_verbs_send(struct ipath_devdata *dd, u32 
hdrwords,
 * or WC buffer can be written out of order.

[ofa-general] [PATCH 03/23] IB/ipath - change UD to queue work requests like RC UC

2007-10-09 Thread Arthur Jones

From: Ralph Campbell [EMAIL PROTECTED]

The code to post UD sends tried to process work requests at the
time ib_post_send() is called without using a WQE queue.
This was fine as long as HW resources were available for sending
a packet. This patch changes UD to be handled more like RC and UC
and shares more code.

Signed-off-by: Ralph Campbell [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_qp.c|   11 -
 drivers/infiniband/hw/ipath/ipath_rc.c|   61 +++--
 drivers/infiniband/hw/ipath/ipath_ruc.c   |  308 
 drivers/infiniband/hw/ipath/ipath_uc.c|   77 ++
 drivers/infiniband/hw/ipath/ipath_ud.c|  372 ++---
 drivers/infiniband/hw/ipath/ipath_verbs.c |  241 +--
 drivers/infiniband/hw/ipath/ipath_verbs.h |   35 ++-
 7 files changed, 494 insertions(+), 611 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c 
b/drivers/infiniband/hw/ipath/ipath_qp.c
index 1324b35..a8c4a6b 100644
--- a/drivers/infiniband/hw/ipath/ipath_qp.c
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c
@@ -338,6 +338,7 @@ static void ipath_reset_qp(struct ipath_qp *qp)
qp-s_busy = 0;
qp-s_flags = IPATH_S_SIGNAL_REQ_WR;
qp-s_hdrwords = 0;
+   qp-s_wqe = NULL;
qp-s_psn = 0;
qp-r_psn = 0;
qp-r_msn = 0;
@@ -751,6 +752,9 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd,
switch (init_attr-qp_type) {
case IB_QPT_UC:
case IB_QPT_RC:
+   case IB_QPT_UD:
+   case IB_QPT_SMI:
+   case IB_QPT_GSI:
sz = sizeof(struct ipath_sge) *
init_attr-cap.max_send_sge +
sizeof(struct ipath_swqe);
@@ -759,10 +763,6 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd,
ret = ERR_PTR(-ENOMEM);
goto bail;
}
-   /* FALLTHROUGH */
-   case IB_QPT_UD:
-   case IB_QPT_SMI:
-   case IB_QPT_GSI:
sz = sizeof(*qp);
if (init_attr-srq) {
struct ipath_srq *srq = to_isrq(init_attr-srq);
@@ -805,8 +805,7 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd,
spin_lock_init(qp-r_rq.lock);
atomic_set(qp-refcount, 0);
init_waitqueue_head(qp-wait);
-   tasklet_init(qp-s_task, ipath_do_ruc_send,
-(unsigned long)qp);
+   tasklet_init(qp-s_task, ipath_do_send, (unsigned long)qp);
INIT_LIST_HEAD(qp-piowait);
INIT_LIST_HEAD(qp-timerwait);
qp-state = IB_QPS_RESET;
diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c 
b/drivers/infiniband/hw/ipath/ipath_rc.c
index 46744ea..53259da 100644
--- a/drivers/infiniband/hw/ipath/ipath_rc.c
+++ b/drivers/infiniband/hw/ipath/ipath_rc.c
@@ -81,9 +81,8 @@ static void ipath_init_restart(struct ipath_qp *qp, struct 
ipath_swqe *wqe)
  * Note that we are in the responder's side of the QP context.
  * Note the QP s_lock must be held.
  */
-static int ipath_make_rc_ack(struct ipath_qp *qp,
-struct ipath_other_headers *ohdr,
-u32 pmtu, u32 *bth0p, u32 *bth2p)
+static int ipath_make_rc_ack(struct ipath_ibdev *dev, struct ipath_qp *qp,
+struct ipath_other_headers *ohdr, u32 pmtu)
 {
struct ipath_ack_entry *e;
u32 hwords;
@@ -192,8 +191,7 @@ static int ipath_make_rc_ack(struct ipath_qp *qp,
}
qp-s_hdrwords = hwords;
qp-s_cur_size = len;
-   *bth0p = bth0 | (1  22); /* Set M bit */
-   *bth2p = bth2;
+   ipath_make_ruc_header(dev, qp, ohdr, bth0, bth2);
return 1;
 
 bail:
@@ -203,32 +201,39 @@ bail:
 /**
  * ipath_make_rc_req - construct a request packet (SEND, RDMA r/w, ATOMIC)
  * @qp: a pointer to the QP
- * @ohdr: a pointer to the IB header being constructed
- * @pmtu: the path MTU
- * @bth0p: pointer to the BTH opcode word
- * @bth2p: pointer to the BTH PSN word
  *
  * Return 1 if constructed; otherwise, return 0.
- * Note the QP s_lock must be held and interrupts disabled.
  */
-int ipath_make_rc_req(struct ipath_qp *qp,
- struct ipath_other_headers *ohdr,
- u32 pmtu, u32 *bth0p, u32 *bth2p)
+int ipath_make_rc_req(struct ipath_qp *qp)
 {
struct ipath_ibdev *dev = to_idev(qp-ibqp.device);
+   struct ipath_other_headers *ohdr;
struct ipath_sge_state *ss;
struct ipath_swqe *wqe;
u32 hwords;
u32 len;
u32 bth0;
u32 bth2;
+   u32 pmtu = ib_mtu_enum_to_int(qp-path_mtu);
char newreq;
+   unsigned long flags;
+   int ret = 0;
+
+   ohdr = qp-s_hdr.u.oth;
+   if (qp-remote_ah_attr.ah_flags  IB_AH_GRH)
+   ohdr = qp-s_hdr.u.l.oth;
+
+   /*
+* The lock is needed to synchronize between the sending tasklet,
+* the receive interrupt handler,

[ofa-general] [PATCH 04/23] IB/ipath - Verify host bus bandwidth to chip will not limit performance

2007-10-09 Thread Arthur Jones

From: Dave Olson [EMAIL PROTECTED]

There have been a number of issues where host bandwidth via
HyperTransport or PCIe to the InfiniPath chip has been
limited in some fashion (BIOS, configuration, etc.), resulting
in user confusion.   This check gives a clear warning that
something is wrong and needs to be resolved.

Signed-off-by: Dave Olson [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_driver.c |   85 
 1 files changed, 85 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c 
b/drivers/infiniband/hw/ipath/ipath_driver.c
index 6ccba36..8fa2bb5 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -34,6 +34,7 @@
 #include linux/spinlock.h
 #include linux/idr.h
 #include linux/pci.h
+#include linux/io.h
 #include linux/delay.h
 #include linux/netdevice.h
 #include linux/vmalloc.h
@@ -280,6 +281,88 @@ void __attribute__((weak)) ipath_disable_wc(struct 
ipath_devdata *dd)
 {
 }
 
+/*
+ * Perform a PIO buffer bandwidth write test, to verify proper system
+ * configuration.  Even when all the setup calls work, occasionally
+ * BIOS or other issues can prevent write combining from working, or
+ * can cause other bandwidth problems to the chip.
+ *
+ * This test simply writes the same buffer over and over again, and
+ * measures close to the peak bandwidth to the chip (not testing
+ * data bandwidth to the wire).   On chips that use an address-based
+ * trigger to send packets to the wire, this is easy.  On chips that
+ * use a count to trigger, we want to make sure that the packet doesn't
+ * go out on the wire, or trigger flow control checks.
+ */
+static void ipath_verify_pioperf(struct ipath_devdata *dd)
+{
+   u32 pbnum, cnt, lcnt;
+   u32 __iomem *piobuf;
+   u32 *addr;
+   u64 msecs, emsecs;
+
+   piobuf = ipath_getpiobuf(dd, pbnum);
+   if (!piobuf) {
+   dev_info(dd-pcidev-dev,
+   No PIObufs for checking perf, skipping\n);
+   goto done;
+
+   }
+
+   /*
+* Enough to give us a reasonable test, less than piobuf size, and
+* likely multiple of store buffer length.
+*/
+   cnt = 1024;
+
+   addr = vmalloc(cnt);
+   if (!addr) {
+   dev_info(dd-pcidev-dev,
+   Couldn't get memory for checking PIO perf,
+skipping\n);
+   goto done;
+   }
+
+
+   preempt_disable();  /* we want reasonably accurate elapsed time */
+   msecs = 1 + jiffies_to_msecs(jiffies);
+   for (lcnt = 0; lcnt  1U; lcnt++) {
+   /* wait until we cross msec boundary */
+   if (jiffies_to_msecs(jiffies) = msecs)
+   break;
+   udelay(1);
+   }
+
+   writeq(0, piobuf); /* length 0, no dwords actually sent */
+   ipath_flush_wc();
+
+   /*
+* this is only roughly accurate, since even with preempt we
+* still take interrupts that could take a while.   Running for
+* = 5 msec seems to get us close enough to accurate values
+*/
+   msecs = jiffies_to_msecs(jiffies);
+   for (emsecs = lcnt = 0; emsecs = 5UL; lcnt++) {
+   __iowrite32_copy(piobuf + 64, addr, cnt  2);
+   emsecs = jiffies_to_msecs(jiffies) - msecs;
+   }
+
+   /* 1 GiB/sec, slightly over IB SDR line rate */
+   if (lcnt  (emsecs * 1024U))
+   ipath_dev_err(dd,
+   Performance problem: bandwidth to PIO buffers is 
+   only %u MiB/sec\n,
+   lcnt / (u32) emsecs);
+   else
+   ipath_dbg(PIO buffer bandwidth %u MiB/sec is OK\n,
+   lcnt / (u32) emsecs);
+
+   preempt_enable();
+done:
+   if (piobuf) /* disarm it, so it's available again */
+   ipath_disarm_piobufs(dd, pbnum, 1);
+}
+
 static int __devinit ipath_init_one(struct pci_dev *pdev,
const struct pci_device_id *ent)
 {
@@ -515,6 +598,8 @@ static int __devinit ipath_init_one(struct pci_dev *pdev,
ret = 0;
}
 
+   ipath_verify_pioperf(dd);
+
ipath_device_create_group(pdev-dev, dd);
ipathfs_add_device(dd);
ipath_user_add(dd);

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 05/23] IB/ipath - Remove unneeded code for ipathfs

2007-10-09 Thread Arthur Jones

From: Ralph Campbell [EMAIL PROTECTED]

The ipathfs file system is used to export binary data verses ASCII
data such as through /sys. This patch removes some unneeded files
since the data is available through other /sys files.

Signed-off-by: Ralph Campbell [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_fs.c |  187 
 1 files changed, 0 insertions(+), 187 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_fs.c 
b/drivers/infiniband/hw/ipath/ipath_fs.c
index 2e689b9..262c25d 100644
--- a/drivers/infiniband/hw/ipath/ipath_fs.c
+++ b/drivers/infiniband/hw/ipath/ipath_fs.c
@@ -130,175 +130,6 @@ static const struct file_operations atomic_counters_ops = 
{
.read = atomic_counters_read,
 };
 
-static ssize_t atomic_node_info_read(struct file *file, char __user *buf,
-size_t count, loff_t *ppos)
-{
-   u32 nodeinfo[10];
-   struct ipath_devdata *dd;
-   u64 guid;
-
-   dd = file-f_path.dentry-d_inode-i_private;
-
-   guid = be64_to_cpu(dd-ipath_guid);
-
-   nodeinfo[0] =   /* BaseVersion is SMA */
-   /* ClassVersion is SMA */
-   (1  8)/* NodeType  */
-   | (1  0); /* NumPorts */
-   nodeinfo[1] = (u32) (guid  32);
-   nodeinfo[2] = (u32) (guid  0x);
-   /* PortGUID == SystemImageGUID for us */
-   nodeinfo[3] = nodeinfo[1];
-   /* PortGUID == SystemImageGUID for us */
-   nodeinfo[4] = nodeinfo[2];
-   /* PortGUID == NodeGUID for us */
-   nodeinfo[5] = nodeinfo[3];
-   /* PortGUID == NodeGUID for us */
-   nodeinfo[6] = nodeinfo[4];
-   nodeinfo[7] = (4  16) /* we support 4 pkeys */
-   | (dd-ipath_deviceid  0);
-   /* our chip version as 16 bits major, 16 bits minor */
-   nodeinfo[8] = dd-ipath_minrev | (dd-ipath_majrev  16);
-   nodeinfo[9] = (dd-ipath_unit  24) | (dd-ipath_vendorid  0);
-
-   return simple_read_from_buffer(buf, count, ppos, nodeinfo,
-  sizeof nodeinfo);
-}
-
-static const struct file_operations atomic_node_info_ops = {
-   .read = atomic_node_info_read,
-};
-
-static ssize_t atomic_port_info_read(struct file *file, char __user *buf,
-size_t count, loff_t *ppos)
-{
-   u32 portinfo[13];
-   u32 tmp, tmp2;
-   struct ipath_devdata *dd;
-
-   dd = file-f_path.dentry-d_inode-i_private;
-
-   /* so we only initialize non-zero fields. */
-   memset(portinfo, 0, sizeof portinfo);
-
-   /*
-* Notimpl yet M_Key (64)
-* Notimpl yet GID (64)
-*/
-
-   portinfo[4] = (dd-ipath_lid  16);
-
-   /*
-* Notimpl yet SMLID.
-* CapabilityMask is 0, we don't support any of these
-* DiagCode is 0; we don't store any diag info for now Notimpl yet
-* M_KeyLeasePeriod (we don't support M_Key)
-*/
-
-   /* LocalPortNum is whichever port number they ask for */
-   portinfo[7] = (dd-ipath_unit  24)
-   /* LinkWidthEnabled */
-   | (2  16)
-   /* LinkWidthSupported (really 2, but not IB valid) */
-   | (3  8)
-   /* LinkWidthActive */
-   | (2  0);
-   tmp = dd-ipath_lastibcstat  IPATH_IBSTATE_MASK;
-   tmp2 = 5;
-   if (tmp == IPATH_IBSTATE_INIT)
-   tmp = 2;
-   else if (tmp == IPATH_IBSTATE_ARM)
-   tmp = 3;
-   else if (tmp == IPATH_IBSTATE_ACTIVE)
-   tmp = 4;
-   else {
-   tmp = 0;/* down */
-   tmp2 = tmp  0xf;
-   }
-
-   portinfo[8] = (1  28) /* LinkSpeedSupported */
-   | (tmp  24)   /* PortState */
-   | (tmp2  20)  /* PortPhysicalState */
-   | (2  16)
-
-   /* LinkDownDefaultState */
-   /* M_KeyProtectBits == 0 */
-   /* NotImpl yet LMC == 0 (we can support all values) */
-   | (1  4)  /* LinkSpeedActive */
-   | (1  0); /* LinkSpeedEnabled */
-   switch (dd-ipath_ibmtu) {
-   case 4096:
-   tmp = 5;
-   break;
-   case 2048:
-   tmp = 4;
-   break;
-   case 1024:
-   tmp = 3;
-   break;
-   case 512:
-   tmp = 2;
-   break;
-   case 256:
-   tmp = 1;
-   break;
-   default:/* oops, something is wrong */
-   ipath_dbg(Problem, ipath_ibmtu 0x%x not a valid IB MTU, 
- treat as 2048\n, dd-ipath_ibmtu);
-   tmp = 4;
-   break;
-   }
-   portinfo[9] = (tmp  28)
-   /* NeighborMTU */
-   /* Notimpl MasterSMSL */
-   | (1  20)
-
-   /* VLCap */
-   /* Notimpl InitType (actually, an SMA decision) */

[ofa-general] [PATCH 06/23] IB/ipath - correctly describe workaround for TID write chip bug

2007-10-09 Thread Arthur Jones

From: Dave Olson [EMAIL PROTECTED]

This is a comment change, only, correcting the comment to match the
implemented workaround, rather than the original workaround, and
clarifying why it's needed.

Signed-off-by: Dave Olson [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_iba6120.c |   13 -
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c 
b/drivers/infiniband/hw/ipath/ipath_iba6120.c
index a324c6f..d43f0b3 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba6120.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c
@@ -1143,11 +1143,14 @@ static void ipath_pe_put_tid(struct ipath_devdata *dd, 
u64 __iomem *tidptr,
pa |= 2  29;
}
 
-   /* workaround chip bug 9437 by writing each TID twice
-* and holding a spinlock around the writes, so they don't
-* intermix with other TID (eager or expected) writes
-* Unfortunately, this call can be done from interrupt level
-* for the port 0 eager TIDs, so we have to use irqsave
+   /*
+* Workaround chip bug 9437 by writing the scratch register
+* before and after the TID, and with an io write barrier.
+* We use a spinlock around the writes, so they can't intermix
+* with other TID (eager or expected) writes (the chip bug
+* is triggered by back to back TID writes). Unfortunately, this
+* call can be done from interrupt level for the port 0 eager TIDs,
+* so we have to use irqsave locks.
 */
spin_lock_irqsave(dd-ipath_tid_lock, flags);
ipath_write_kreg(dd, dd-ipath_kregs-kr_scratch, 0xfeeddeaf);

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 07/23] IB/ipath - UC RDMA WRITE with IMMEDIATE doesn't send the immediate

2007-10-09 Thread Arthur Jones

From: Ralph Campbell [EMAIL PROTECTED]

This patch fixes a bug in the receive processing for UC RDMA WRITE
with immediate which caused the last packet to be dropped.

Signed-off-by: Ralph Campbell [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_uc.c |   21 +++--
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_uc.c 
b/drivers/infiniband/hw/ipath/ipath_uc.c
index 767beb9..2dd8de2 100644
--- a/drivers/infiniband/hw/ipath/ipath_uc.c
+++ b/drivers/infiniband/hw/ipath/ipath_uc.c
@@ -464,6 +464,16 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct 
ipath_ib_header *hdr,
 
case OP(RDMA_WRITE_LAST_WITH_IMMEDIATE):
rdma_last_imm:
+   if (header_in_data) {
+   wc.imm_data = *(__be32 *) data;
+   data += sizeof(__be32);
+   } else {
+   /* Immediate data comes after BTH */
+   wc.imm_data = ohdr-u.imm_data;
+   }
+   hdrsize += 4;
+   wc.wc_flags = IB_WC_WITH_IMM;
+
/* Get the number of bytes the message was padded by. */
pad = (be32_to_cpu(ohdr-bth[0])  20)  3;
/* Check for invalid length. */
@@ -484,16 +494,7 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct 
ipath_ib_header *hdr,
dev-n_pkt_drops++;
goto done;
}
-   if (header_in_data) {
-   wc.imm_data = *(__be32 *) data;
-   data += sizeof(__be32);
-   } else {
-   /* Immediate data comes after BTH */
-   wc.imm_data = ohdr-u.imm_data;
-   }
-   hdrsize += 4;
-   wc.wc_flags = IB_WC_WITH_IMM;
-   wc.byte_len = 0;
+   wc.byte_len = qp-r_len;
goto last_imm;
 
case OP(RDMA_WRITE_LAST):

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 08/23] IB/ipath - future proof eeprom checksum code (contents reading)

2007-10-09 Thread Arthur Jones

From: Dave Olson [EMAIL PROTECTED]

In an earlier change, the amount of data read from the flash was
mistakenly limited to the size known to the current driver.  This
causes problems when the length is increased, and written with the
new longer version; the checksum would fail because not enough data
was read.  Always read the full 128 byte length to prevent this.

Signed-off-by: Dave Olson [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_eeprom.c |   10 --
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_eeprom.c 
b/drivers/infiniband/hw/ipath/ipath_eeprom.c
index b4503e9..bcfa3cc 100644
--- a/drivers/infiniband/hw/ipath/ipath_eeprom.c
+++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c
@@ -596,7 +596,11 @@ void ipath_get_eeprom_info(struct ipath_devdata *dd)
goto bail;
}
 
-   len = offsetof(struct ipath_flash, if_future);
+   /*
+* read full flash, not just currently used part, since it may have
+* been written with a newer definition
+* */
+   len = sizeof(struct ipath_flash);
buf = vmalloc(len);
if (!buf) {
ipath_dev_err(dd, Couldn't allocate memory to read %u 
@@ -737,8 +741,10 @@ int ipath_update_eeprom_log(struct ipath_devdata *dd)
/*
 * The quick-check above determined that there is something worthy
 * of logging, so get current contents and do a more detailed idea.
+* read full flash, not just currently used part, since it may have
+* been written with a newer definition
 */
-   len = offsetof(struct ipath_flash, if_future);
+   len = sizeof(struct ipath_flash);
buf = vmalloc(len);
ret = 1;
if (!buf) {

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 09/23] IB/ipath - Remove redundant code

2007-10-09 Thread Arthur Jones

From: Ralph Campbell [EMAIL PROTECTED]

This patch removes some redundant initialization code.

Signed-off-by: Ralph Campbell [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_driver.c |5 -
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c 
b/drivers/infiniband/hw/ipath/ipath_driver.c
index 8fa2bb5..e5d058a 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -381,8 +381,6 @@ static int __devinit ipath_init_one(struct pci_dev *pdev,
 
ipath_cdbg(VERBOSE, initializing unit #%u\n, dd-ipath_unit);
 
-   read_bars(dd, pdev, bar0, bar1);
-
ret = pci_enable_device(pdev);
if (ret) {
/* This can happen iff:
@@ -528,9 +526,6 @@ static int __devinit ipath_init_one(struct pci_dev *pdev,
goto bail_regions;
}
 
-   dd-ipath_deviceid = ent-device;   /* save for later use */
-   dd-ipath_vendorid = ent-vendor;
-
dd-ipath_pcirev = pdev-revision;
 
 #if defined(__powerpc__)

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 12/23] IB/ipath - optimize completion queue entry insertion and polling

2007-10-09 Thread Arthur Jones

From: Ralph Campbell [EMAIL PROTECTED]

The code to add an entry to the completion queue stored the QPN
which is needed for the user level verbs view of the completion
queue entry but the kernel struct ib_wc contains a pointer to the QP
instead of a QPN. When the kernel polled for a completion queue entry,
the QPN was lookup up and the QP pointer recovered.
This patch stores the CQE differently based on whether the CQ is
a kernel CQ or a user CQ thus avoiding the QPN to QP lookup overhead.

Signed-off-by: Ralph Campbell [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_cq.c|   94 +++--
 drivers/infiniband/hw/ipath/ipath_verbs.h |6 ++
 2 files changed, 53 insertions(+), 47 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c 
b/drivers/infiniband/hw/ipath/ipath_cq.c
index a6f04d2..645ed71 100644
--- a/drivers/infiniband/hw/ipath/ipath_cq.c
+++ b/drivers/infiniband/hw/ipath/ipath_cq.c
@@ -76,22 +76,25 @@ void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc 
*entry, int solicited)
}
return;
}
-   wc-queue[head].wr_id = entry-wr_id;
-   wc-queue[head].status = entry-status;
-   wc-queue[head].opcode = entry-opcode;
-   wc-queue[head].vendor_err = entry-vendor_err;
-   wc-queue[head].byte_len = entry-byte_len;
-   wc-queue[head].imm_data = (__u32 __force)entry-imm_data;
-   wc-queue[head].qp_num = entry-qp-qp_num;
-   wc-queue[head].src_qp = entry-src_qp;
-   wc-queue[head].wc_flags = entry-wc_flags;
-   wc-queue[head].pkey_index = entry-pkey_index;
-   wc-queue[head].slid = entry-slid;
-   wc-queue[head].sl = entry-sl;
-   wc-queue[head].dlid_path_bits = entry-dlid_path_bits;
-   wc-queue[head].port_num = entry-port_num;
-   /* Make sure queue entry is written before the head index. */
-   smp_wmb();
+   if (cq-ip) {
+   wc-uqueue[head].wr_id = entry-wr_id;
+   wc-uqueue[head].status = entry-status;
+   wc-uqueue[head].opcode = entry-opcode;
+   wc-uqueue[head].vendor_err = entry-vendor_err;
+   wc-uqueue[head].byte_len = entry-byte_len;
+   wc-uqueue[head].imm_data = (__u32 __force)entry-imm_data;
+   wc-uqueue[head].qp_num = entry-qp-qp_num;
+   wc-uqueue[head].src_qp = entry-src_qp;
+   wc-uqueue[head].wc_flags = entry-wc_flags;
+   wc-uqueue[head].pkey_index = entry-pkey_index;
+   wc-uqueue[head].slid = entry-slid;
+   wc-uqueue[head].sl = entry-sl;
+   wc-uqueue[head].dlid_path_bits = entry-dlid_path_bits;
+   wc-uqueue[head].port_num = entry-port_num;
+   /* Make sure entry is written before the head index. */
+   smp_wmb();
+   } else
+   wc-kqueue[head] = *entry;
wc-head = next;
 
if (cq-notify == IB_CQ_NEXT_COMP ||
@@ -130,6 +133,12 @@ int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, 
struct ib_wc *entry)
int npolled;
u32 tail;
 
+   /* The kernel can only poll a kernel completion queue */
+   if (cq-ip) {
+   npolled = -EINVAL;
+   goto bail;
+   }
+
spin_lock_irqsave(cq-lock, flags);
 
wc = cq-queue;
@@ -137,31 +146,10 @@ int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, 
struct ib_wc *entry)
if (tail  (u32) cq-ibcq.cqe)
tail = (u32) cq-ibcq.cqe;
for (npolled = 0; npolled  num_entries; ++npolled, ++entry) {
-   struct ipath_qp *qp;
-
if (tail == wc-head)
break;
-   /* Make sure entry is read after head index is read. */
-   smp_rmb();
-   qp = ipath_lookup_qpn(to_idev(cq-ibcq.device)-qp_table,
- wc-queue[tail].qp_num);
-   entry-qp = qp-ibqp;
-   if (atomic_dec_and_test(qp-refcount))
-   wake_up(qp-wait);
-
-   entry-wr_id = wc-queue[tail].wr_id;
-   entry-status = wc-queue[tail].status;
-   entry-opcode = wc-queue[tail].opcode;
-   entry-vendor_err = wc-queue[tail].vendor_err;
-   entry-byte_len = wc-queue[tail].byte_len;
-   entry-imm_data = wc-queue[tail].imm_data;
-   entry-src_qp = wc-queue[tail].src_qp;
-   entry-wc_flags = wc-queue[tail].wc_flags;
-   entry-pkey_index = wc-queue[tail].pkey_index;
-   entry-slid = wc-queue[tail].slid;
-   entry-sl = wc-queue[tail].sl;
-   entry-dlid_path_bits = wc-queue[tail].dlid_path_bits;
-   entry-port_num = wc-queue[tail].port_num;
+   /* The kernel doesn't need a RMB since it has the lock. */
+   *entry = wc-kqueue[tail];
if (tail = cq-ibcq.cqe)
tail = 0;
else
@@

[ofa-general] [PATCH 13/23] IB/ipath -- Add ability to set the LMC via the sysfs debugging interface

2007-10-09 Thread Arthur Jones

From: Ralph Campbell [EMAIL PROTECTED]

This patch adds the ability to set the LMC via a sysfs file
as if the SM sent a SubnSet(PortInfo) MAD.  It is useful
for debugging when no SM is running.

Signed-off-by: Ralph Campbell [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_sysfs.c |   40 -
 1 files changed, 39 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_sysfs.c 
b/drivers/infiniband/hw/ipath/ipath_sysfs.c
index 16238cd..e1ad7cf 100644
--- a/drivers/infiniband/hw/ipath/ipath_sysfs.c
+++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c
@@ -163,6 +163,42 @@ static ssize_t show_boardversion(struct device *dev,
return scnprintf(buf, PAGE_SIZE, %s, dd-ipath_boardversion);
 }
 
+static ssize_t show_lmc(struct device *dev,
+   struct device_attribute *attr,
+   char *buf)
+{
+   struct ipath_devdata *dd = dev_get_drvdata(dev);
+
+   return scnprintf(buf, PAGE_SIZE, %u\n, dd-ipath_lmc);
+}
+
+static ssize_t store_lmc(struct device *dev,
+struct device_attribute *attr,
+const char *buf,
+size_t count)
+{
+   struct ipath_devdata *dd = dev_get_drvdata(dev);
+   u16 lmc = 0;
+   int ret;
+
+   ret = ipath_parse_ushort(buf, lmc);
+   if (ret  0)
+   goto invalid;
+
+   if (lmc  7) {
+   ret = -EINVAL;
+   goto invalid;
+   }
+
+   ipath_set_lid(dd, dd-ipath_lid, lmc);
+
+   goto bail;
+invalid:
+   ipath_dev_err(dd, attempt to set invalid LMC %u\n, lmc);
+bail:
+   return ret;
+}
+
 static ssize_t show_lid(struct device *dev,
struct device_attribute *attr,
char *buf)
@@ -190,7 +226,7 @@ static ssize_t store_lid(struct device *dev,
goto invalid;
}
 
-   ipath_set_lid(dd, lid, 0);
+   ipath_set_lid(dd, lid, dd-ipath_lmc);
 
goto bail;
 invalid:
@@ -648,6 +684,7 @@ static struct attribute_group driver_attr_group = {
 };
 
 static DEVICE_ATTR(guid, S_IWUSR | S_IRUGO, show_guid, store_guid);
+static DEVICE_ATTR(lmc, S_IWUSR | S_IRUGO, show_lmc, store_lmc);
 static DEVICE_ATTR(lid, S_IWUSR | S_IRUGO, show_lid, store_lid);
 static DEVICE_ATTR(link_state, S_IWUSR, NULL, store_link_state);
 static DEVICE_ATTR(mlid, S_IWUSR | S_IRUGO, show_mlid, store_mlid);
@@ -667,6 +704,7 @@ static DEVICE_ATTR(logged_errors, S_IRUGO, 
show_logged_errs, NULL);
 
 static struct attribute *dev_attributes[] = {
dev_attr_guid.attr,
+   dev_attr_lmc.attr,
dev_attr_lid.attr,
dev_attr_link_state.attr,
dev_attr_mlid.attr,

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 14/23] IB/ipath - remove duplicate copy of LMC

2007-10-09 Thread Arthur Jones

From: Ralph Campbell [EMAIL PROTECTED]

The LMC value was being saved by the SMA in two places.
This patch cleans it up so only one copy is kept.

Signed-off-by: Ralph Campbell [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_mad.c   |   39 -
 drivers/infiniband/hw/ipath/ipath_ud.c|   10 ---
 drivers/infiniband/hw/ipath/ipath_verbs.c |4 +--
 drivers/infiniband/hw/ipath/ipath_verbs.h |2 +
 4 files changed, 29 insertions(+), 26 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c 
b/drivers/infiniband/hw/ipath/ipath_mad.c
index d61c030..8f15216 100644
--- a/drivers/infiniband/hw/ipath/ipath_mad.c
+++ b/drivers/infiniband/hw/ipath/ipath_mad.c
@@ -245,7 +245,7 @@ static int recv_subn_get_portinfo(struct ib_smp *smp,
 
/* Only return the mkey if the protection field allows it. */
if (smp-method == IB_MGMT_METHOD_SET || dev-mkey == smp-mkey ||
-   (dev-mkeyprot_resv_lmc  6) == 0)
+   dev-mkeyprot == 0)
pip-mkey = dev-mkey;
pip-gid_prefix = dev-gid_prefix;
lid = dev-dd-ipath_lid;
@@ -264,7 +264,7 @@ static int recv_subn_get_portinfo(struct ib_smp *smp,
pip-portphysstate_linkdown =
(ipath_cvt_physportstate[ibcstat  0xf]  4) |
(get_linkdowndefaultstate(dev-dd) ? 1 : 2);
-   pip-mkeyprot_resv_lmc = dev-mkeyprot_resv_lmc;
+   pip-mkeyprot_resv_lmc = (dev-mkeyprot  6) | dev-dd-ipath_lmc;
pip-linkspeedactive_enabled = 0x11;/* 2.5Gbps, 2.5Gbps */
switch (dev-dd-ipath_ibmtu) {
case 4096:
@@ -401,6 +401,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp,
struct ib_port_info *pip = (struct ib_port_info *)smp-data;
struct ib_event event;
struct ipath_ibdev *dev;
+   struct ipath_devdata *dd;
u32 flags;
char clientrereg = 0;
u16 lid, smlid;
@@ -415,6 +416,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp,
goto err;
 
dev = to_idev(ibdev);
+   dd = dev-dd;
event.device = ibdev;
event.element.port_num = port;
 
@@ -423,11 +425,12 @@ static int recv_subn_set_portinfo(struct ib_smp *smp,
dev-mkey_lease_period = be16_to_cpu(pip-mkey_lease_period);
 
lid = be16_to_cpu(pip-lid);
-   if (lid != dev-dd-ipath_lid) {
+   if (dd-ipath_lid != lid ||
+   dd-ipath_lmc != (pip-mkeyprot_resv_lmc  7)) {
/* Must be a valid unicast LID address. */
if (lid == 0 || lid = IPATH_MULTICAST_LID_BASE)
goto err;
-   ipath_set_lid(dev-dd, lid, pip-mkeyprot_resv_lmc  7);
+   ipath_set_lid(dd, lid, pip-mkeyprot_resv_lmc  7);
event.event = IB_EVENT_LID_CHANGE;
ib_dispatch_event(event);
}
@@ -461,18 +464,18 @@ static int recv_subn_set_portinfo(struct ib_smp *smp,
case 0: /* NOP */
break;
case 1: /* SLEEP */
-   if (set_linkdowndefaultstate(dev-dd, 1))
+   if (set_linkdowndefaultstate(dd, 1))
goto err;
break;
case 2: /* POLL */
-   if (set_linkdowndefaultstate(dev-dd, 0))
+   if (set_linkdowndefaultstate(dd, 0))
goto err;
break;
default:
goto err;
}
 
-   dev-mkeyprot_resv_lmc = pip-mkeyprot_resv_lmc;
+   dev-mkeyprot = pip-mkeyprot_resv_lmc  6;
dev-vl_high_limit = pip-vl_high_limit;
 
switch ((pip-neighbormtu_mastersmsl  4)  0xF) {
@@ -495,7 +498,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp,
/* XXX We have already partially updated our state! */
goto err;
}
-   ipath_set_mtu(dev-dd, mtu);
+   ipath_set_mtu(dd, mtu);
 
dev-sm_sl = pip-neighbormtu_mastersmsl  0xF;
 
@@ -511,16 +514,16 @@ static int recv_subn_set_portinfo(struct ib_smp *smp,
 * later.
 */
if (pip-pkey_violations == 0)
-   dev-z_pkey_violations = ipath_get_cr_errpkey(dev-dd);
+   dev-z_pkey_violations = ipath_get_cr_errpkey(dd);
 
if (pip-qkey_violations == 0)
dev-qkey_violations = 0;
 
ore = pip-localphyerrors_overrunerrors;
-   if (set_phyerrthreshold(dev-dd, (ore  4)  0xF))
+   if (set_phyerrthreshold(dd, (ore  4)  0xF))
goto err;
 
-   if (set_overrunthreshold(dev-dd, (ore  0xF)))
+   if (set_overrunthreshold(dd, (ore  0xF)))
goto err;
 
dev-subnet_timeout = pip-clientrereg_resv_subnetto  0x1F;
@@ -538,7 +541,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp,
 * is down or is being set to down.
 */
state = pip-linkspeed_portstate  0xF;
-   flags = dev-dd-ipath_flags;
+   flags = dd-ipath_flags;
lstate = (pip-portphysstate_linkdown  4)  0xF;
if (lstate

[ofa-general] [PATCH 20/23] IB/ipath - better handling of unexpected GPIO interrupts

2007-10-09 Thread Arthur Jones

From: Michael Albaugh [EMAIL PROTECTED]

The General Purpose I/O pins can be configured to cause interrupts.
At the end of the interrupt code dealing with all known causes, a
message is output if any bits remain un-handled. Since this is a
can't happen scenario, it should only be triggered by bugs elsewhere.
It is harmless, and potentially beneficial, to limit the damage by
masking any such unexpected interrupts.

This patch adds disabling of interrupts from any pins that should
not have been allowed to interrupt, in addition to emitting a message.

Signed-off-by: Michael Albaugh [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_intr.c |   10 ++
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c 
b/drivers/infiniband/hw/ipath/ipath_intr.c
index 61eac8c..801a20d 100644
--- a/drivers/infiniband/hw/ipath/ipath_intr.c
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c
@@ -1124,10 +1124,8 @@ irqreturn_t ipath_intr(int irq, void *data)
/*
 * Some unexpected bits remain. If they could have
 * caused the interrupt, complain and clear.
-* MEA: this is almost certainly non-ideal.
-* we should look into auto-disable of unexpected
-* GPIO interrupts, possibly on a three strikes
-* basis.
+* To avoid repetition of this condition, also clear
+* the mask. It is almost certainly due to error.
 */
const u32 mask = (u32) dd-ipath_gpio_mask;
 
@@ -1135,6 +1133,10 @@ irqreturn_t ipath_intr(int irq, void *data)
ipath_dbg(Unexpected GPIO IRQ bits %x\n,
  gpiostatus  mask);
to_clear |= (gpiostatus  mask);
+   dd-ipath_gpio_mask = ~(gpiostatus  mask);
+   ipath_write_kreg(dd,
+   dd-ipath_kregs-kr_gpio_mask,
+   dd-ipath_gpio_mask);
}
}
if (to_clear) {

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 21/23] IB/ipath - fix IB_EVENT_PORT_ERR event

2007-10-09 Thread Arthur Jones

From: Ralph Campbell [EMAIL PROTECTED]

The link state event calls were being generated when the SM told
the SMA to change link states. This works for IB_EVENT_PORT_ACTIVE
but not if the link goes down and stays down. The fix is to
generate event calls from the interrupt handler when the HW link state
changes.

Signed-off-by: Ralph Campbell [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_driver.c |2 ++
 drivers/infiniband/hw/ipath/ipath_intr.c   |   17 +
 drivers/infiniband/hw/ipath/ipath_kernel.h |2 ++
 drivers/infiniband/hw/ipath/ipath_mad.c|   10 --
 drivers/infiniband/hw/ipath/ipath_verbs.c  |   12 ++--
 5 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c 
b/drivers/infiniband/hw/ipath/ipath_driver.c
index e5d058a..799fac2 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -2085,6 +2085,8 @@ void ipath_shutdown_device(struct ipath_devdata *dd)
INFINIPATH_IBCC_LINKINITCMD_SHIFT);
ipath_cancel_sends(dd, 0);
 
+   signal_ib_event(dd, IB_EVENT_PORT_ERR);
+
/* disable IBC */
dd-ipath_control = ~INFINIPATH_C_LINKENABLE;
ipath_write_kreg(dd, dd-ipath_kregs-kr_control,
diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c 
b/drivers/infiniband/hw/ipath/ipath_intr.c
index 801a20d..6a5dd5c 100644
--- a/drivers/infiniband/hw/ipath/ipath_intr.c
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c
@@ -275,6 +275,16 @@ static char *ib_linkstate(u32 linkstate)
return ret;
 }
 
+void signal_ib_event(struct ipath_devdata *dd, enum ib_event_type ev)
+{
+   struct ib_event event;
+
+   event.device = dd-verbs_dev-ibdev;
+   event.element.port_num = 1;
+   event.event = ev;
+   ib_dispatch_event(event);
+}
+
 static void handle_e_ibstatuschanged(struct ipath_devdata *dd,
 ipath_err_t errs, int noprint)
 {
@@ -373,6 +383,8 @@ static void handle_e_ibstatuschanged(struct ipath_devdata 
*dd,
dd-ipath_ibpollcnt = 0;/* some state other than 2 or 3 */
ipath_stats.sps_iblink++;
if (ltstate != INFINIPATH_IBCS_LT_STATE_LINKUP) {
+   if (dd-ipath_flags  IPATH_LINKACTIVE)
+   signal_ib_event(dd, IB_EVENT_PORT_ERR);
dd-ipath_flags |= IPATH_LINKDOWN;
dd-ipath_flags = ~(IPATH_LINKUNK | IPATH_LINKINIT
 | IPATH_LINKACTIVE |
@@ -405,7 +417,10 @@ static void handle_e_ibstatuschanged(struct ipath_devdata 
*dd,
*dd-ipath_statusp |=
IPATH_STATUS_IB_READY | IPATH_STATUS_IB_CONF;
dd-ipath_f_setextled(dd, lstate, ltstate);
+   signal_ib_event(dd, IB_EVENT_PORT_ACTIVE);
} else if ((val  IPATH_IBSTATE_MASK) == IPATH_IBSTATE_INIT) {
+   if (dd-ipath_flags  IPATH_LINKACTIVE)
+   signal_ib_event(dd, IB_EVENT_PORT_ERR);
/*
 * set INIT and DOWN.  Down is checked by most of the other
 * code, but INIT is useful to know in a few places.
@@ -418,6 +433,8 @@ static void handle_e_ibstatuschanged(struct ipath_devdata 
*dd,
| IPATH_STATUS_IB_READY);
dd-ipath_f_setextled(dd, lstate, ltstate);
} else if ((val  IPATH_IBSTATE_MASK) == IPATH_IBSTATE_ARM) {
+   if (dd-ipath_flags  IPATH_LINKACTIVE)
+   signal_ib_event(dd, IB_EVENT_PORT_ERR);
dd-ipath_flags |= IPATH_LINKARMED;
dd-ipath_flags =
~(IPATH_LINKUNK | IPATH_LINKDOWN | IPATH_LINKINIT |
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h 
b/drivers/infiniband/hw/ipath/ipath_kernel.h
index 872fb36..8786dd7 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -42,6 +42,7 @@
 #include linux/pci.h
 #include linux/dma-mapping.h
 #include asm/io.h
+#include rdma/ib_verbs.h
 
 #include ipath_common.h
 #include ipath_debug.h
@@ -775,6 +776,7 @@ void ipath_get_eeprom_info(struct ipath_devdata *);
 int ipath_update_eeprom_log(struct ipath_devdata *dd);
 void ipath_inc_eeprom_err(struct ipath_devdata *dd, u32 eidx, u32 incr);
 u64 ipath_snap_cntr(struct ipath_devdata *, ipath_creg);
+void signal_ib_event(struct ipath_devdata *dd, enum ib_event_type ev);
 
 /*
  * Set LED override, only the two LSBs have public meaning, but
diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c 
b/drivers/infiniband/hw/ipath/ipath_mad.c
index 8f15216..0ae3a7c 100644
--- a/drivers/infiniband/hw/ipath/ipath_mad.c
+++ b/drivers/infiniband/hw/ipath/ipath_mad.c
@@ -570,26 +570,16 @@ static int recv_subn_set_portinfo(struct ib_smp *smp,
else
goto err;
ipath_set_linkstate(dd, lstate);
-   if (flags

[ofa-general] [PATCH 22/23] IB/ipath - remove redundant link state checks

2007-10-09 Thread Arthur Jones

From: Ralph Campbell [EMAIL PROTECTED]

This patch removes some redundant checks when the SMA changes the
link state since the same checks are made in the lower level function
that sets the state.

Signed-off-by: Ralph Campbell [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_mad.c |6 --
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c 
b/drivers/infiniband/hw/ipath/ipath_mad.c
index 0ae3a7c..3d1432d 100644
--- a/drivers/infiniband/hw/ipath/ipath_mad.c
+++ b/drivers/infiniband/hw/ipath/ipath_mad.c
@@ -402,7 +402,6 @@ static int recv_subn_set_portinfo(struct ib_smp *smp,
struct ib_event event;
struct ipath_ibdev *dev;
struct ipath_devdata *dd;
-   u32 flags;
char clientrereg = 0;
u16 lid, smlid;
u8 lwe;
@@ -541,7 +540,6 @@ static int recv_subn_set_portinfo(struct ib_smp *smp,
 * is down or is being set to down.
 */
state = pip-linkspeed_portstate  0xF;
-   flags = dd-ipath_flags;
lstate = (pip-portphysstate_linkdown  4)  0xF;
if (lstate  !(state == IB_PORT_DOWN || state == IB_PORT_NOP))
goto err;
@@ -572,13 +570,9 @@ static int recv_subn_set_portinfo(struct ib_smp *smp,
ipath_set_linkstate(dd, lstate);
break;
case IB_PORT_ARMED:
-   if (!(flags  (IPATH_LINKINIT | IPATH_LINKACTIVE)))
-   break;
ipath_set_linkstate(dd, IPATH_IB_LINKARM);
break;
case IB_PORT_ACTIVE:
-   if (!(flags  IPATH_LINKARMED))
-   break;
ipath_set_linkstate(dd, IPATH_IB_LINKACTIVE);
break;
default:

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 23/23] IB/ipath -- Minor fix to ordering of freeing and zeroing of tid pages.

2007-10-09 Thread Arthur Jones

From: Dave Olson [EMAIL PROTECTED]

Fixed to be the same as everywhere else.  copy and then zero the
page * in the array first, and then pass the copy to the VM routines.

Signed-off-by: Dave Olson [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_file_ops.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c 
b/drivers/infiniband/hw/ipath/ipath_file_ops.c
index 016e7c4..5de3243 100644
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c
@@ -538,6 +538,9 @@ static int ipath_tid_free(struct ipath_portdata *pd, 
unsigned subport,
continue;
cnt++;
if (dd-ipath_pageshadow[porttid + tid]) {
+   struct page *p;
+   p = dd-ipath_pageshadow[porttid + tid];
+   dd-ipath_pageshadow[porttid + tid] = NULL;
ipath_cdbg(VERBOSE, PID %u freeing TID %u\n,
   pd-port_pid, tid);
dd-ipath_f_put_tid(dd, tidbase[tid],
@@ -546,9 +549,7 @@ static int ipath_tid_free(struct ipath_portdata *pd, 
unsigned subport,
pci_unmap_page(dd-pcidev,
dd-ipath_physshadow[porttid + tid],
PAGE_SIZE, PCI_DMA_FROMDEVICE);
-   ipath_release_user_pages(
-   dd-ipath_pageshadow[porttid + tid], 1);
-   dd-ipath_pageshadow[porttid + tid] = NULL;
+   ipath_release_user_pages(p, 1);
ipath_stats.sps_pageunlocks++;
} else
ipath_dbg(Unused tid %u, ignoring\n, tid);

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] SDP ?

2007-10-09 Thread Jim Mott

That should work fine.  You might be able to build with -D_XPG4_2 as well.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jim Langston
Sent: Tuesday, October 09, 2007 10:13 AM
To: general@lists.openfabrics.org
Subject: [ofa-general] SDP ?

Hi all,

I'm working on porting SDP to OpenSolaris and am looking at a
compile error that I get. Essentially, I have a conflict of types on
the compile:

bash-3.00$ /opt/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I. -I.. -g 
-D_POSIX_PTHREAD_SEMANTICS -DSYSCONFDIR=\/usr/local/etc\ -g 
-D_POSIX_PTHREAD_SEMANTICS -c port.c  -KPIC -DPIC -o .libs/port.o
port.c, line 1896: identifier redeclared: getsockname
current : function(int, pointer to struct sockaddr {unsigned 
short sa_family, array[14] of char sa_data}, pointer to unsigned int) 
returning int
previous: function(int, pointer to struct sockaddr {unsigned 
short sa_family, array[14] of char sa_data}, pointer to void) returning 
int : /usr/include/sys/socket.h, line 436


Line 436 in /usr/include/sys/socket.h

extern int getsockname(int, struct sockaddr *_RESTRICT_KYWD, Psocklen_t);


and Psocklen_t

#if defined(_XPG4_2) || defined(_BOOT)
typedef socklen_t   *_RESTRICT_KYWD Psocklen_t;
#else
typedef void*_RESTRICT_KYWD Psocklen_t;
#endif  /* defined(_XPG4_2) || defined(_BOOT) */


Do I need to change port.c getsockname to type void * ?


Thanks,

Jim
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 15/23] IB/ipath - use counters in ipath_poll and cleanup interrupts in ipath_close

2007-10-09 Thread Arthur Jones

ipath_poll() suffered from a couple subtle bugs.  Under
the right conditions we could leave recv interrupts enabled
on an ipath user context on close, thereby taking potentially
unwanted interrupts on the next open -- this is fixed by
unconditionally turning off recv interrupts on close.  Also,
we now use counters rather than set/clear bits which allows
us to make sure we catch all interrupts at the cost of changing
the semantics slightly (it's now give me all events since the
last time I called poll() rather than give me all events since
I called _this_ poll routine).  We also added some memory
barriers which may help ensure we get all notifications in
a timely manner.

Signed-off-by: Arthur Jones [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_file_ops.c |   67 --
 drivers/infiniband/hw/ipath/ipath_intr.c |   33 -
 drivers/infiniband/hw/ipath/ipath_kernel.h   |8 ++-
 3 files changed, 57 insertions(+), 51 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c 
b/drivers/infiniband/hw/ipath/ipath_file_ops.c
index 33ab0d6..016e7c4 100644
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c
@@ -1341,6 +1341,19 @@ bail:
return ret;
 }
 
+static unsigned ipath_poll_hdrqfull(struct ipath_portdata *pd)
+{
+   unsigned pollflag = 0;
+
+   if ((pd-poll_type  IPATH_POLL_TYPE_OVERFLOW) 
+   pd-port_hdrqfull != pd-port_hdrqfull_poll) {
+   pollflag |= POLLIN | POLLRDNORM;
+   pd-port_hdrqfull_poll = pd-port_hdrqfull;
+   }
+
+   return pollflag;
+}
+
 static unsigned int ipath_poll_urgent(struct ipath_portdata *pd,
  struct file *fp,
  struct poll_table_struct *pt)
@@ -1350,22 +1363,20 @@ static unsigned int ipath_poll_urgent(struct 
ipath_portdata *pd,
 
dd = pd-port_dd;
 
-   if (test_bit(IPATH_PORT_WAITING_OVERFLOW, pd-int_flag)) {
-   pollflag |= POLLERR;
-   clear_bit(IPATH_PORT_WAITING_OVERFLOW, pd-int_flag);
-   }
+   /* variable access in ipath_poll_hdrqfull() needs this */
+   rmb();
+   pollflag = ipath_poll_hdrqfull(pd);
 
-   if (test_bit(IPATH_PORT_WAITING_URG, pd-int_flag)) {
+   if (pd-port_urgent != pd-port_urgent_poll) {
pollflag |= POLLIN | POLLRDNORM;
-   clear_bit(IPATH_PORT_WAITING_URG, pd-int_flag);
+   pd-port_urgent_poll = pd-port_urgent;
}
 
if (!pollflag) {
+   /* this saves a spin_lock/unlock in interrupt handler... */
set_bit(IPATH_PORT_WAITING_URG, pd-port_flag);
-   if (pd-poll_type  IPATH_POLL_TYPE_OVERFLOW)
-   set_bit(IPATH_PORT_WAITING_OVERFLOW,
-   pd-port_flag);
-
+   /* flush waiting flag so don't miss an event... */
+   wmb();
poll_wait(fp, pd-port_wait, pt);
}
 
@@ -1376,31 +1387,27 @@ static unsigned int ipath_poll_next(struct 
ipath_portdata *pd,
struct file *fp,
struct poll_table_struct *pt)
 {
-   u32 head, tail;
+   u32 head;
+   u32 tail;
unsigned pollflag = 0;
struct ipath_devdata *dd;
 
dd = pd-port_dd;
 
+   /* variable access in ipath_poll_hdrqfull() needs this */
+   rmb();
+   pollflag = ipath_poll_hdrqfull(pd);
+
head = ipath_read_ureg32(dd, ur_rcvhdrhead, pd-port_port);
tail = *(volatile u64 *)pd-port_rcvhdrtail_kvaddr;
 
-   if (test_bit(IPATH_PORT_WAITING_OVERFLOW, pd-int_flag)) {
-   pollflag |= POLLERR;
-   clear_bit(IPATH_PORT_WAITING_OVERFLOW, pd-int_flag);
-   }
-
-   if (tail != head ||
-   test_bit(IPATH_PORT_WAITING_RCV, pd-int_flag)) {
+   if (head != tail)
pollflag |= POLLIN | POLLRDNORM;
-   clear_bit(IPATH_PORT_WAITING_RCV, pd-int_flag);
-   }
-
-   if (!pollflag) {
+   else {
+   /* this saves a spin_lock/unlock in interrupt handler */
set_bit(IPATH_PORT_WAITING_RCV, pd-port_flag);
-   if (pd-poll_type  IPATH_POLL_TYPE_OVERFLOW)
-   set_bit(IPATH_PORT_WAITING_OVERFLOW,
-   pd-port_flag);
+   /* flush waiting flag so we don't miss an event */
+   wmb();
 
set_bit(pd-port_port + INFINIPATH_R_INTRAVAIL_SHIFT,
dd-ipath_rcvctrl);
@@ -1917,6 +1924,12 @@ static int ipath_do_user_init(struct file *fp,
ipath_cdbg(VERBOSE, Wrote port%d egrhead %x from tail regs\n,
pd-port_port, head32);
pd-port_tidcursor = 0; /* start at beginning after open */
+
+   /* initialize poll variables... */
+   pd-port_urgent = 0;
+   pd-port_urgent_poll = 0;
+

Re: [ofa-general] [PATCH] rdma/cm: add locking around QP accesses

2007-10-09 Thread Roland Dreier

Did we ever get any confirmation that this fixed the problem that Olaf saw?
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller

From: Jeff Garzik [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 08:44:25 -0400

 David Miller wrote:
  From: Krishna Kumar2 [EMAIL PROTECTED]
  Date: Tue, 9 Oct 2007 16:51:14 +0530

  David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:

  Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
  it.
  Great, this will make life easy. Any idea how long that would take?
  It seems simple enough to do.

  I'd say we can probably try to get rid of it in 2.6.25, this is
  assuming we get driver authors to cooperate and do the conversions
  or alternatively some other motivated person.

  I can just threaten to do them all and that should get the driver
  maintainers going :-)

 What, like this?  :)

Thanks, but it's probably going to need some corrections and/or
an audit.

If you unconditionally take those locks in the transmit function,
there is probably an ABBA deadlock elsewhere in the driver now, most
likely in the TX reclaim processing, and you therefore need to handle
that too.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [PATCH] fix some ehca limits

2007-10-09 Thread Roland Dreier

I didn't see a response to my earlier email about the other uses of
min_t(int, x, INT_MAX) so I fixed it up myself and added this to my
tree.  I don't have a working setup to test yet so please let me know
if you see anything wrong with this:

commit 919225e60a1a73e3518f257f040f74e9379a61c3
Author: Roland Dreier [EMAIL PROTECTED]
Date:   Tue Oct 9 13:17:42 2007 -0700

IB/ehca: Fix clipping of device limits to INT_MAX

Doing min_t(int, foo, INT_MAX) doesn't work correctly, because if foo
is bigger than INT_MAX, then when treated as a signed integer, it will
become negative and hence such an expression is just an elaborate NOP.

Fix such cases in ehca to do min_t(unsigned, foo, INT_MAX) instead.
This fixes negative reported values for max_cqe, max_pd and max_ah:

Before:

max_cqe:-64
max_pd: -1
max_ah: -1

After:
max_cqe:2147483647
max_pd: 2147483647
max_ah: 2147483647

Based on a bug report and fix from Anton Blanchard [EMAIL PROTECTED].

Signed-off-by: Roland Dreier [EMAIL PROTECTED]

diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c 
b/drivers/infiniband/hw/ehca/ehca_hca.c
index 3436c49..4aa3ffa 100644
--- a/drivers/infiniband/hw/ehca/ehca_hca.c
+++ b/drivers/infiniband/hw/ehca/ehca_hca.c
@@ -82,17 +82,17 @@ int ehca_query_device(struct ib_device *ibdev, struct 
ib_device_attr *props)
props-vendor_id   = rblock-vendor_id  8;
props-vendor_part_id  = rblock-vendor_part_id  16;
props-hw_ver  = rblock-hw_ver;
-   props-max_qp  = min_t(int, rblock-max_qp, INT_MAX);
-   props-max_qp_wr   = min_t(int, rblock-max_wqes_wq, INT_MAX);
-   props-max_sge = min_t(int, rblock-max_sge, INT_MAX);
-   props-max_sge_rd  = min_t(int, rblock-max_sge_rd, INT_MAX);
-   props-max_cq  = min_t(int, rblock-max_cq, INT_MAX);
-   props-max_cqe = min_t(int, rblock-max_cqe, INT_MAX);
-   props-max_mr  = min_t(int, rblock-max_mr, INT_MAX);
-   props-max_mw  = min_t(int, rblock-max_mw, INT_MAX);
-   props-max_pd  = min_t(int, rblock-max_pd, INT_MAX);
-   props-max_ah  = min_t(int, rblock-max_ah, INT_MAX);
-   props-max_fmr = min_t(int, rblock-max_mr, INT_MAX);
+   props-max_qp  = min_t(unsigned, rblock-max_qp, INT_MAX);
+   props-max_qp_wr   = min_t(unsigned, rblock-max_wqes_wq, INT_MAX);
+   props-max_sge = min_t(unsigned, rblock-max_sge, INT_MAX);
+   props-max_sge_rd  = min_t(unsigned, rblock-max_sge_rd, INT_MAX);
+   props-max_cq  = min_t(unsigned, rblock-max_cq, INT_MAX);
+   props-max_cqe = min_t(unsigned, rblock-max_cqe, INT_MAX);
+   props-max_mr  = min_t(unsigned, rblock-max_mr, INT_MAX);
+   props-max_mw  = min_t(unsigned, rblock-max_mw, INT_MAX);
+   props-max_pd  = min_t(unsigned, rblock-max_pd, INT_MAX);
+   props-max_ah  = min_t(unsigned, rblock-max_ah, INT_MAX);
+   props-max_fmr = min_t(unsigned, rblock-max_mr, INT_MAX);
 
if (EHCA_BMASK_GET(HCA_CAP_SRQ, shca-hca_cap)) {
props-max_srq = props-max_qp;
@@ -104,15 +104,15 @@ int ehca_query_device(struct ib_device *ibdev, struct 
ib_device_attr *props)
props-local_ca_ack_delay
= rblock-local_ca_ack_delay;
props-max_raw_ipv6_qp
-   = min_t(int, rblock-max_raw_ipv6_qp, INT_MAX);
+   = min_t(unsigned, rblock-max_raw_ipv6_qp, INT_MAX);
props-max_raw_ethy_qp
-   = min_t(int, rblock-max_raw_ethy_qp, INT_MAX);
+   = min_t(unsigned, rblock-max_raw_ethy_qp, INT_MAX);
props-max_mcast_grp
-   = min_t(int, rblock-max_mcast_grp, INT_MAX);
+   = min_t(unsigned, rblock-max_mcast_grp, INT_MAX);
props-max_mcast_qp_attach
-   = min_t(int, rblock-max_mcast_qp_attach, INT_MAX);
+   = min_t(unsigned, rblock-max_mcast_qp_attach, INT_MAX);
props-max_total_mcast_qp_attach
-   = min_t(int, rblock-max_total_mcast_qp_attach, INT_MAX);
+   = min_t(unsigned, rblock-max_total_mcast_qp_attach, INT_MAX);
 
/* translate device capabilities */
props-device_cap_flags = IB_DEVICE_SYS_IMAGE_GUID |
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 16/23] IB/ipath - iba6110 rev4 no longer needs recv header overrun workaround

2007-10-09 Thread Arthur Jones

iba6110 rev3 and earlier had a chip bug where
the chip could overrun the recv header queue.
rev4 fixed this chip bug so userspace no longer
needs to workaround it.  Now we only set the
workaround flag for older chip versions.

Signed-off-by: Arthur Jones [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_iba6110.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c 
b/drivers/infiniband/hw/ipath/ipath_iba6110.c
index e1c5998..d4940be 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba6110.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c
@@ -1599,8 +1599,10 @@ static int ipath_ht_get_base_info(struct ipath_portdata 
*pd, void *kbase)
 {
struct ipath_base_info *kinfo = kbase;
 
-   kinfo-spi_runtime_flags |= IPATH_RUNTIME_HT |
-   IPATH_RUNTIME_RCVHDR_COPY;
+   kinfo-spi_runtime_flags |= IPATH_RUNTIME_HT;
+
+   if (pd-port_dd-ipath_minrev  4)
+   kinfo-spi_runtime_flags |= IPATH_RUNTIME_RCVHDR_COPY;
 
return 0;
 }

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 11/23] IB/ipath - implement IB_EVENT_QP_LAST_WQE_REACHED

2007-10-09 Thread Arthur Jones

From: Ralph Campbell [EMAIL PROTECTED]

This patch implements the IB_EVENT_QP_LAST_WQE_REACHED event
which is needed by ib_ipoib to destroy the QP when used in
connected mode.

Signed-off-by: Ralph Campbell [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_qp.c|   20 +---
 drivers/infiniband/hw/ipath/ipath_rc.c|   12 +++-
 drivers/infiniband/hw/ipath/ipath_verbs.h |2 +-
 3 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c 
b/drivers/infiniband/hw/ipath/ipath_qp.c
index a8c4a6b..6a41fdb 100644
--- a/drivers/infiniband/hw/ipath/ipath_qp.c
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c
@@ -377,13 +377,15 @@ static void ipath_reset_qp(struct ipath_qp *qp)
  * @err: the receive completion error to signal if a RWQE is active
  *
  * Flushes both send and receive work queues.
+ * Returns true if last WQE event should be generated.
  * The QP s_lock should be held and interrupts disabled.
  */
 
-void ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err)
+int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err)
 {
struct ipath_ibdev *dev = to_idev(qp-ibqp.device);
struct ib_wc wc;
+   int ret = 0;
 
ipath_dbg(QP%d/%d in error state\n,
  qp-ibqp.qp_num, qp-remote_qpn);
@@ -454,7 +456,10 @@ void ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status 
err)
wq-tail = tail;
 
spin_unlock(qp-r_rq.lock);
-   }
+   } else if (qp-ibqp.event_handler)
+   ret = 1;
+
+   return ret;
 }
 
 /**
@@ -473,6 +478,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr 
*attr,
struct ipath_qp *qp = to_iqp(ibqp);
enum ib_qp_state cur_state, new_state;
unsigned long flags;
+   int lastwqe = 0;
int ret;
 
spin_lock_irqsave(qp-s_lock, flags);
@@ -532,7 +538,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr 
*attr,
break;
 
case IB_QPS_ERR:
-   ipath_error_qp(qp, IB_WC_WR_FLUSH_ERR);
+   lastwqe = ipath_error_qp(qp, IB_WC_WR_FLUSH_ERR);
break;
 
default:
@@ -591,6 +597,14 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr 
*attr,
qp-state = new_state;
spin_unlock_irqrestore(qp-s_lock, flags);
 
+   if (lastwqe) {
+   struct ib_event ev;
+
+   ev.device = qp-ibqp.device;
+   ev.element.qp = qp-ibqp;
+   ev.event = IB_EVENT_QP_LAST_WQE_REACHED;
+   qp-ibqp.event_handler(ev, qp-ibqp.qp_context);
+   }
ret = 0;
goto bail;
 
diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c 
b/drivers/infiniband/hw/ipath/ipath_rc.c
index 53259da..5c29b2b 100644
--- a/drivers/infiniband/hw/ipath/ipath_rc.c
+++ b/drivers/infiniband/hw/ipath/ipath_rc.c
@@ -1497,11 +1497,21 @@ send_ack:
 static void ipath_rc_error(struct ipath_qp *qp, enum ib_wc_status err)
 {
unsigned long flags;
+   int lastwqe;
 
spin_lock_irqsave(qp-s_lock, flags);
qp-state = IB_QPS_ERR;
-   ipath_error_qp(qp, err);
+   lastwqe = ipath_error_qp(qp, err);
spin_unlock_irqrestore(qp-s_lock, flags);
+
+   if (lastwqe) {
+   struct ib_event ev;
+
+   ev.device = qp-ibqp.device;
+   ev.element.qp = qp-ibqp;
+   ev.event = IB_EVENT_QP_LAST_WQE_REACHED;
+   qp-ibqp.event_handler(ev, qp-ibqp.qp_context);
+   }
 }
 
 static inline void ipath_update_ack_queue(struct ipath_qp *qp, unsigned n)
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h 
b/drivers/infiniband/hw/ipath/ipath_verbs.h
index 619ad72..a197229 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h
@@ -672,7 +672,7 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd,
 
 int ipath_destroy_qp(struct ib_qp *ibqp);
 
-void ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err);
+int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err);
 
 int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
int attr_mask, struct ib_udata *udata);

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 10/23] IB/ipath - generate flush CQE when QP is in error state.

2007-10-09 Thread Arthur Jones

From: Ralph Campbell [EMAIL PROTECTED]

Follow the IB spec. (C10-96) for post send which states
that a flushed completion event should be generated when the
QP is in the error state.

Signed-off-by: Ralph Campbell [EMAIL PROTECTED]
---

 drivers/infiniband/hw/ipath/ipath_verbs.c |   22 --
 1 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c 
b/drivers/infiniband/hw/ipath/ipath_verbs.c
index 3cc82b6..495194b 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -230,6 +230,18 @@ void ipath_skip_sge(struct ipath_sge_state *ss, u32 length)
}
 }
 
+static void ipath_flush_wqe(struct ipath_qp *qp, struct ib_send_wr *wr)
+{
+   struct ib_wc wc;
+
+   memset(wc, 0, sizeof(wc));
+   wc.wr_id = wr-wr_id;
+   wc.status = IB_WC_WR_FLUSH_ERR;
+   wc.opcode = ib_ipath_wc_opcode[wr-opcode];
+   wc.qp = qp-ibqp;
+   ipath_cq_enter(to_icq(qp-ibqp.send_cq), wc, 1);
+}
+
 /**
  * ipath_post_one_send - post one RC, UC, or UD send work request
  * @qp: the QP to post on
@@ -248,8 +260,14 @@ static int ipath_post_one_send(struct ipath_qp *qp, struct 
ib_send_wr *wr)
spin_lock_irqsave(qp-s_lock, flags);
 
/* Check that state is OK to post send. */
-   if (!(ib_ipath_state_ops[qp-state]  IPATH_POST_SEND_OK))
-   goto bail_inval;
+   if (unlikely(!(ib_ipath_state_ops[qp-state]  IPATH_POST_SEND_OK))) {
+   if (qp-state != IB_QPS_SQE  qp-state != IB_QPS_ERR)
+   goto bail_inval;
+   /* C10-96 says generate a flushed completion entry. */
+   ipath_flush_wqe(qp, wr);
+   ret = 0;
+   goto bail;
+   }
 
/* IB spec says that num_sge == 0 is OK. */
if (wr-num_sge  qp-s_max_sge)

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [PATCH] rdma/cm: add locking around QP accesses

2007-10-09 Thread Sean Hefty


Did we ever get any confirmation that this fixed the problem that Olaf saw?


No.  I haven't seen a response.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCHES] TX batching rev2.5

2007-10-09 Thread jamal


Please provide feedback on the code and/or architecture.
They are now updated to work with the latest rebased net-2.6.24 
from a few hours ago.
I am on travel mode so wont have time to do more testing for the next
few days - i do consider this to be stable at this point based on
what i have been testing (famous last words).

Patch 1: Introduces batching interface
Patch 2: Core uses batching interface
Patch 3: get rid of dev-gso_skb

What has changed since i posted last:
-

1) There was some cruft leftover from prep_frame feature that i forgot
to remove last time - it is now gone.
2) In the shower this AM, i realized that it is plausible that a 
batch of packets sent to the driver may all be dropped because
they are badly formatted. Current drivers all return NETDEV_TX_OK
in all such cases. This will lead for us to call dev-hard_end_xmit() 
to be invoked unnecessarily. I already had a NETDEV_TX_DROPPED within
batching drivers, so i made that global and make the batching drivers
return it if they drop packets. The core calls dev-hard_end_xmit()
when at least one packet makes it through. 


Things i am gonna say that nobody will see (wink)
-
Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i
misunderstood. 
Herbert, if you can look at at least patch 3 i will appreaciate it
(since it kills dev-gso_skb that you introduced).

UPCOMING PATCHES
---
As before:
More patches to follow later if i get some feedback - i didnt want to 
overload people by dumping too many patches. Most of these patches 
mentioned below are ready to go; some need some re-testing and others 
need a little porting from an earlier kernel: 
- tg3 driver 
- tun driver
- pktgen
- netiron driver
- e1000e driver (non-LLTX)
- ethtool interface
- There is at least one other driver promised to me

Theres also a driver-howto that i will post soon today. 

PERFORMANCE TESTING


System under test hardware is still a 2xdual core opteron with a 
couple of tg3s. 
A test tool generates udp traffic of different sizes for upto 60 
seconds per run or a total of 30M packets. I have 4 threads each 
running on a specific CPU which keep all the CPUs as busy as they can 
sending packets targetted at a directly connected box's udp discard
port.
All 4 CPUs target a single tg3 to send. The receiving box has a tc rule 
which counts and drops all incoming udp packets to discard port - this
allows me to make sure that the receiver is not the bottleneck in the
testing. Packet sizes sent are {8B, 32B, 64B, 128B, 256B, 512B, 1024B}. 
Each packet size run is repeated 10 times to ensure that there are no
transients. The average of all 10 runs is then computed and collected.

I do plan also to run forwarding and TCP tests in the future when the
dust settles.

cheers,
jamal


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 1/3] [NET_BATCH] Introduce batching interface Rev2.5

2007-10-09 Thread jamal

This patch introduces the netdevice interface for batching.

cheers,
jamal
[NET_BATCH] Introduce batching interface

This patch introduces the netdevice interface for batching.

BACKGROUND
-

A driver dev-hard_start_xmit() has 4 typical parts:
a) packet formating (example vlan, mss, descriptor counting etc)
b) chip specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew on,
tx completion interupts, set last tx time, etc

[For code cleanliness/readability sake, regardless of this work,
one should break the dev-hard_start_xmit() into those 4 functions
anyways].

INTRODUCING API
---

With the api introduced in this patch, a driver which has all
4 parts and needing to support batching is advised to split its
dev-hard_start_xmit() in the following manner:
1)Remove #d from dev-hard_start_xmit() and put it in
dev-hard_end_xmit() method.
2)#b and #c can stay in -hard_start_xmit() (or whichever way you want
to do this)
3) #a is deffered to future work to reduce confusion (since it holds
on its own).

Note: There are drivers which may need not support any of the two
approaches (example the tun driver i patched) so the methods are
optional.

xmit_win variable is set by the driver to tell the core how much space
it has to take on new skbs. It is introduced to ensure that when we pass
the driver a list of packets it will swallow all of them - which is
useful because we dont requeue to the qdisc (and avoids burning
unnecessary cpu cycles or introducing any strange re-ordering). The driver
tells us when it invokes netif_wake_queue how much space it has for
descriptors by setting this variable.

Refer to the driver howto for more details.

THEORY OF OPERATION
---

1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented
and GSO packets are accounted for as well.
2. Core grabs TX_LOCK
3. Core loop for all skbs:
invokes driver dev-hard_start_xmit()
4. Core invokes driver dev-hard_end_xmit()

ACKNOWLEDGEMENT AND SOME HISTORY


There's a lot of history and reasoning of why batching in a document
i am writting which i may submit as a patch.
Thomas Graf (who doesnt know this probably) gave me the impetus to
start looking at this back in 2004 when he invited me to the linux
conference he was organizing. Parts of what i presented in SUCON in
2004 talk about batching. Herbert Xu forced me to take a second look around
2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided
me with more motivation in May 2007 when he posted on netdev and engaged
me.
Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan,
Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, David Miller,
and Patrick McHardy, Jeff Garzik and Bill Fink have contributed in one or
more of {bug fixes, enhancements, testing, lively discussion}. The
Broadcom and neterion folks have been outstanding in their help.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit 98d39ea7922fa2719a80eecd02cae359f3d7
tree 63822bf3040ea41846399c589c912c2be654f008
parent 7b4cd20628fe5c4e145c383fcd8d954d38f7be61
author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:06:28 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:06:28 -0400

 include/linux/netdevice.h |9 +-
 net/core/dev.c|   67 ++---
 net/sched/sch_generic.c   |4 +--
 3 files changed, 73 insertions(+), 7 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 91cd3f3..b0e71c9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -86,6 +86,7 @@ struct wireless_dev;
 /* Driver transmit return codes */
 #define NETDEV_TX_OK 0		/* driver took care of packet */
 #define NETDEV_TX_BUSY 1	/* driver tx path was busy*/
+#define NETDEV_TX_DROPPED 2	/* driver tx path dropped packet*/
 #define NETDEV_TX_LOCKED -1	/* driver tx lock was already taken */
 
 /*
@@ -467,6 +468,7 @@ struct net_device
 #define NETIF_F_NETNS_LOCAL	8192	/* Does not change network namespaces */
 #define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 #define NETIF_F_LRO		32768	/* large receive offload */
+#define NETIF_F_BTX		65536	/* Capable of batch tx */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -595,6 +597,9 @@ struct net_device
 	void			*priv;	/* pointer to private data	*/
 	int			(*hard_start_xmit) (struct sk_buff *skb,
 		struct net_device *dev);
+	void			(*hard_end_xmit) (struct net_device *dev);
+	int			xmit_win;
+
 	/* These may be needed for future network-power-down code. */
 	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
 
@@ -609,6 +614,7 @@ struct net_device
 
 	/* delayed register/unregister */
 	struct list_head	todo_list;
+	struct sk_buff_head blist;
 	/* device index hash chain */
 	struct hlist_node	index_hlist;
 
@@ -1043,7 +1049,8 @@

[ofa-general] [PATCH 2/3][NET_BATCH] Rev2.5 net core use batching

2007-10-09 Thread jamal

This patch adds the usage of batching within the core.

cheers,
jamal
[NET_BATCH] net core use batching

This patch adds the usage of batching within the core.
Performance results demonstrating improvement are provided separately.

I have #if-0ed some of the old functions so the patch is more readable.
A future patch will remove all if-0ed content.
Patrick McHardy eyeballed a bug that will cause re-ordering in case
of a requeue.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit c73d8ee8cce61a98f55fbfb2cafe813a7eca472c
tree 8b9155fe15baa4c2e7adb69585c7aa275a6bc896
parent 98d39ea7922fa2719a80eecd02cae359f3d7
author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:13:30 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:13:30 -0400

 net/sched/sch_generic.c |  104 ++-
 1 files changed, 94 insertions(+), 10 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 424c08b..d98c680 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q)
 	return q-q.qlen;
 }
 
+#if 0
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
   struct Qdisc *q)
 {
@@ -110,6 +111,85 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 
 	return ret;
 }
+#endif
+
+static inline int handle_dev_cpu_collision(struct net_device *dev)
+{
+	if (unlikely(dev-xmit_lock_owner == smp_processor_id())) {
+		if (net_ratelimit())
+			printk(KERN_WARNING
+Dead loop on netdevice %s, fix it urgently!\n,
+dev-name);
+		return 1;
+	}
+	__get_cpu_var(netdev_rx_stat).cpu_collision++;
+	return 0;
+}
+
+static inline int
+dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev,
+	   struct Qdisc *q)
+{
+
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue_tail(skbs)) != NULL)
+		q-ops-requeue(skb, q);
+
+	netif_schedule(dev);
+	return 0;
+}
+
+static inline int
+xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev,
+	struct Qdisc *q)
+{
+	int ret = handle_dev_cpu_collision(dev);
+
+	if (ret) {
+		if (!skb_queue_empty(skbs))
+			skb_queue_purge(skbs);
+		return qdisc_qlen(q);
+	}
+
+	return dev_requeue_skbs(skbs, dev, q);
+}
+
+static int xmit_count_skbs(struct sk_buff *skb)
+{
+	int count = 0;
+	for (; skb; skb = skb-next) {
+		count += skb_shinfo(skb)-nr_frags;
+		count += 1;
+	}
+	return count;
+}
+
+static int xmit_get_pkts(struct net_device *dev,
+			   struct Qdisc *q,
+			   struct sk_buff_head *pktlist)
+{
+	struct sk_buff *skb;
+	int count = dev-xmit_win;
+
+	if (count   dev-gso_skb) {
+		skb = dev-gso_skb;
+		dev-gso_skb = NULL;
+		count -= xmit_count_skbs(skb);
+		__skb_queue_tail(pktlist, skb);
+	}
+
+	while (count  0) {
+		skb = q-dequeue(q);
+		if (!skb)
+			break;
+
+		count -= xmit_count_skbs(skb);
+		__skb_queue_tail(pktlist, skb);
+	}
+
+	return skb_queue_len(pktlist);
+}
 
 /*
  * NOTE: Called under dev-queue_lock with locally disabled BH.
@@ -133,19 +213,20 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 static inline int qdisc_restart(struct net_device *dev)
 {
 	struct Qdisc *q = dev-qdisc;
-	struct sk_buff *skb;
-	int ret, xcnt = 0;
+	int ret = 0;
 
-	/* Dequeue packet */
-	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
-		return 0;
+	/* Dequeue packets */
+	ret = xmit_get_pkts(dev, q, dev-blist);
 
+	if (!ret)
+		return 0;
 
-	/* And release queue */
+	/* We got em packets */
 	spin_unlock(dev-queue_lock);
 
+	/* bye packets */
 	HARD_TX_LOCK(dev, smp_processor_id());
-	ret = dev_hard_start_xmit(skb, dev, xcnt);
+	ret = dev_batch_xmit(dev);
 	HARD_TX_UNLOCK(dev);
 
 	spin_lock(dev-queue_lock);
@@ -158,8 +239,8 @@ static inline int qdisc_restart(struct net_device *dev)
 		break;
 
 	case NETDEV_TX_LOCKED:
-		/* Driver try lock failed */
-		ret = handle_dev_cpu_collision(skb, dev, q);
+		/* Driver lock failed */
+		ret = xmit_islocked(dev-blist, dev, q);
 		break;
 
 	default:
@@ -168,7 +249,7 @@ static inline int qdisc_restart(struct net_device *dev)
 			printk(KERN_WARNING BUG %s code %d qlen %d\n,
 			   dev-name, ret, q-q.qlen);
 
-		ret = dev_requeue_skb(skb, dev, q);
+		ret = dev_requeue_skbs(dev-blist, dev, q);
 		break;
 	}
 
@@ -564,6 +645,9 @@ void dev_deactivate(struct net_device *dev)
 
 	skb = dev-gso_skb;
 	dev-gso_skb = NULL;
+	if (!skb_queue_empty(dev-blist))
+		skb_queue_purge(dev-blist);
+	dev-xmit_win = 1;
 	spin_unlock_bh(dev-queue_lock);
 
 	kfree_skb(skb);
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 3/3][NET_BATCH] Rev2.5 kill dev-gso_skb

2007-10-09 Thread jamal

This patch removes dev-gso_skb as it is no longer necessary with
batching code.

cheers,
jamal

[NET_BATCH] kill dev-gso_skb
The batching code does what gso used to batch at the drivers.
There is no more need for gso_skb. If for whatever reason the
requeueing is a bad idea we are going to leave packets in dev-blist
(and still not need dev-gso_skb)

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit fac8a4147548f314d4edb74634e78e5b06e0e135
tree 72114acb327bc7e3eb219275df6b3aab7459795c
parent c73d8ee8cce61a98f55fbfb2cafe813a7eca472c
author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:22:43 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:22:43 -0400

 include/linux/netdevice.h |3 ---
 net/sched/sch_generic.c   |   12 
 2 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b0e71c9..7592a56 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -578,9 +578,6 @@ struct net_device
 	struct list_head	qdisc_list;
 	unsigned long		tx_queue_len;	/* Max frames per queue allowed */
 
-	/* Partially transmitted GSO packet. */
-	struct sk_buff		*gso_skb;
-
 	/* ingress path synchronizer */
 	spinlock_t		ingress_lock;
 	struct Qdisc		*qdisc_ingress;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index d98c680..36b6972 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev,
 	struct sk_buff *skb;
 	int count = dev-xmit_win;
 
-	if (count   dev-gso_skb) {
-		skb = dev-gso_skb;
-		dev-gso_skb = NULL;
-		count -= xmit_count_skbs(skb);
-		__skb_queue_tail(pktlist, skb);
-	}
-
 	while (count  0) {
 		skb = q-dequeue(q);
 		if (!skb)
@@ -635,7 +628,6 @@ void dev_activate(struct net_device *dev)
 void dev_deactivate(struct net_device *dev)
 {
 	struct Qdisc *qdisc;
-	struct sk_buff *skb;
 
 	spin_lock_bh(dev-queue_lock);
 	qdisc = dev-qdisc;
@@ -643,15 +635,11 @@ void dev_deactivate(struct net_device *dev)
 
 	qdisc_reset(qdisc);
 
-	skb = dev-gso_skb;
-	dev-gso_skb = NULL;
 	if (!skb_queue_empty(dev-blist))
 		skb_queue_purge(dev-blist);
 	dev-xmit_win = 1;
 	spin_unlock_bh(dev-queue_lock);
 
-	kfree_skb(skb);
-
 	dev_watchdog_down(dev);
 
 	/* Wait for outstanding dev_queue_xmit calls. */
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCHES] TX batching rev2.5

2007-10-09 Thread jamal

On Tue, 2007-09-10 at 18:07 -0400, jamal wrote:
 Please provide feedback on the code and/or architecture.
 They are now updated to work with the latest rebased net-2.6.24 
 from a few hours ago.

I should have added i have tested this with just the batching changes
and it is within the performance realm of the changes from yesterday. If
anyone wants exact numbers, i can send them.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [DOC][NET_BATCH]Rev2.5 Driver Howto

2007-10-09 Thread jamal


I updated this doc to match the latest patch.

cheers,
jamal
Here's the beginning of a howto for driver authors.

The intended audience for this howto is people already
familiar with netdevices.

1.0  Netdevice Prerequisites
--

For hardware-based netdevices, you must have at least hardware that 
is capable of doing DMA with many descriptors; i.e., having hardware 
with a queue length of 3 (as in some fscked ethernet hardware) is 
not very useful in this case.

2.0  What is new in the driver API
---

There is 1 new method and one new variable introduced that the
driver author needs to be aware of. These are:
1) dev-hard_end_xmit()
2) dev-xmit_win

2.1 Using Core driver changes
-

To provide context, let's look at a typical driver abstraction
for dev-hard_start_xmit(). It has 4 parts:
a) packet formatting (example: vlan, mss, descriptor counting, etc.)
b) chip-specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew 
on, tx completion interrupts, etc.

[For code cleanliness/readability sake, regardless of this work,
one should break the dev-hard_start_xmit() into those 4 functional
blocks anyways].

A driver which has all 4 parts and needing to support batching is 
advised to split its dev-hard_start_xmit() in the following manner:

1) use its dev-hard_end_xmit() method to achieve #d
2) use dev-xmit_win to tell the core how much space you have.

#b and #c can stay in -hard_start_xmit() (or whichever way you 
want to do this)
Section 3. shows more details on the suggested usage.

2.1.1 Theory of operation
--

1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented
and GSO packets are accounted for as well.
2. Core grabs device's TX_LOCK
3. Core loop for all skbs:
 -invokes driver dev-hard_start_xmit()
4. Core invokes driver dev-hard_end_xmit() if packets xmitted

2.1.1.1 The slippery LLTX
-

Since these type of drivers are being phased out and they require extra
code they will not be supported anymore. So as oct07 the code that 
supports them has been removed.

2.1.1.2 xmit_win


dev-xmit_win variable is set by the driver to tell us how
much space it has in its rings/queues. This detail is then 
used to figure out how many packets are retrieved from the qdisc 
queues (in order to send to the driver). 
dev-xmit_win is introduced to ensure that when we pass the driver 
a list of packets it will swallow all of them -- which is useful 
because we don't requeue to the qdisc (and avoids burning unnecessary 
CPU cycles or introducing any strange re-ordering). 
Essentially the driver signals us how much space it has for descriptors 
by setting this variable. 

2.1.1.2.1 Setting xmit_win
--

This variable should be set during xmit path shutdown(netif_stop), 
wakeup(netif_wake) and -hard_end_xmit(). In the case of the first
one the value is set to 1 and in the other two it is set to whatever
the driver deems to be available space on the ring.

3.0 Driver Essentials
-

The typical driver tx state machine is:


-1- +Core sends packets
 +-- Driver puts packet onto hardware queue
 +if hardware queue is full, netif_stop_queue(dev)
 +
-2- +core stops sending because of netif_stop_queue(dev)
..
.. time passes ...
..
-3- +--- driver has transmitted packets, opens up tx path by
  invoking netif_wake_queue(dev)
-1- +Cycle repeats and core sends more packets (step 1).


3.1  Driver prerequisite
--

This is _a very important_ requirement in making batching useful.
The prerequisite for batching changes is that the driver should 
provide a low threshold to open up the tx path.
Drivers such as tg3 and e1000 already do this.
Before you invoke netif_wake_queue(dev) you check if there is a
threshold of space reached to insert new packets.

Here's an example of how I added it to tun driver. Observe the
setting of dev-xmit_win.

---
+#define NETDEV_LTT 4 /* the low threshold to open up the tx path */
..
..
u32 t = skb_queue_len(tun-readq);
if (netif_queue_stopped(tun-dev)  t  NETDEV_LTT) {
tun-dev-xmit_win = tun-dev-tx_queue_len;
netif_wake_queue(tun-dev);
}
---

Heres how the batching e1000 driver does it:

--
if (unlikely(cleaned  netif_carrier_ok(netdev) 
 E1000_DESC_UNUSED(tx_ring) = TX_WAKE_THRESHOLD)) {

if (netif_queue_stopped(netdev)) {
   int rspace =  E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS +  2);
   netdev-xmit_win = rspace;
   netif_wake_queue(netdev);
   }
---

in tg3 code (with no batching changes) looks like:

-
if (netif_queue_stopped(tp-dev) 
(tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH(tp)))
netif_wake_queue(tp-dev);
---

[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Roland Dreier

  Before you add new entries to your list, how is that ibm driver NAPI
  conversion coming along? :-)

OK, thanks for the kick in the pants, I have a couple of patches for
net-2.6.24 coming (including an unrelated trivial warning fix for
IPoIB).

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 1/1] IB/SDP - Zero copy bcopy support

2007-10-09 Thread Jim Mott

This patch adds zero copy send support to SDP.  Below 2K transfer size, 
it is better to bcopy.  With larger transfers, this is a net win on
bandwidth.  Latency testing is yet to be done.

 BCOPYBZCOPY
   1K  TCP_STREAM  3555 Mb/sec  2250 Mb/sec
   2K  TCP_STREAM  3650 Mb/sec  3785 Mb/sec
   4K  TCP_STREAM  3560 Mb/sec  6220 Mb/sec
   8K  TCP_STREAM  3555 Mb/sec  6190 Mb/sec
  16K  TCP_STREAM  5100 Mb/sec  6155 Mb/sec
   1M  TCP_STREAM  4630 Mb/sec  6210 Mb/sec

Performance work still remains.  Open issues include correct setsockopt 
defines (use previous SDP values?), code cleanup, performance tuning,
rigorous regression testing, and multi-OS build+test.  Simple testing to
date includes netperf and iperf, ^C recovery, unload/load, and checking
for gross memory leaks on Rhat4u4.


Signed-off-by: Jim Mott [EMAIL PROTECTED]
---

Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp.h
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/sdp/sdp.h2007-10-08 
08:20:57.0 -0500
+++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp.h 2007-10-08 
08:31:41.0 -0500
@@ -50,6 +50,9 @@ extern int sdp_data_debug_level;
 #define SDP_HEAD_SIZE (PAGE_SIZE / 2 + sizeof(struct sdp_bsdh))
 #define SDP_NUM_WC 4
 
+#define SDP_MIN_ZCOPY_THRESH1024
+#define SDP_MAX_ZCOPY_THRESH 1048576
+
 #define SDP_OP_RECV 0x8LL
 
 enum sdp_mid {
@@ -70,6 +73,13 @@ enum {
SDP_MIN_BUFS = 2
 };
 
+enum {
+   SDP_ERR_ERROR   = -4,
+   SDP_ERR_FAULT   = -3,
+   SDP_NEW_SEG = -2,
+   SDP_DO_WAIT_MEM = -1
+};
+
 struct rdma_cm_id;
 struct rdma_cm_event;
 
@@ -148,6 +158,9 @@ struct sdp_sock {
int recv_frags;
int send_frags;
 
+   /* ZCOPY data */
+   int zcopy_thresh;
+
struct ib_sge ibsge[SDP_MAX_SEND_SKB_FRAGS + 1];
struct ib_wc  ibwc[SDP_NUM_WC];
 };
Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_main.c
===
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/sdp/sdp_main.c   
2007-10-08 08:21:05.0 -0500
+++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_main.c2007-10-09 
16:52:34.0 -0500
@@ -65,6 +65,16 @@ unsigned int csum_partial_copy_from_user
 #include sdp.h
 #include linux/delay.h
 
+struct bzcopy_state {
+   unsigned char __user  *u_base;
+   intu_len;
+   int  left;
+   intpage_cnt;
+ intcur_page;
+ intcur_offset;
+   struct page  **pages;
+};
+
 MODULE_AUTHOR(Michael S. Tsirkin);
 MODULE_DESCRIPTION(InfiniBand SDP module);
 MODULE_LICENSE(Dual BSD/GPL);
@@ -117,6 +127,10 @@ static int send_poll_thresh = 8192;
 module_param_named(send_poll_thresh, send_poll_thresh, int, 0644);
 MODULE_PARM_DESC(send_poll_thresh, Send message size thresh hold over which 
to start polling.);
 
+static int sdp_zcopy_thresh = 0;
+module_param_named(sdp_zcopy_thresh, sdp_zcopy_thresh, int, 0644);
+MODULE_PARM_DESC(sdp_zcopy_thresh, Zero copy send threshold; 0=0ff.);
+
 struct workqueue_struct *sdp_workqueue;
 
 static struct list_head sock_list;
@@ -867,6 +881,12 @@ static int sdp_setsockopt(struct sock *s
sdp_push_pending_frames(sk);
}
break;
+   case SDP_ZCOPY_THRESH:
+   if (val  SDP_MIN_ZCOPY_THRESH || val  SDP_MAX_ZCOPY_THRESH)
+   err = -EINVAL;
+   else
+   ssk-zcopy_thresh = val;
+   break;
default:
err = -ENOPROTOOPT;
break;
@@ -904,6 +924,9 @@ static int sdp_getsockopt(struct sock *s
case TCP_CORK:
val = !!(ssk-nonagleTCP_NAGLE_CORK);
break;
+   case SDP_ZCOPY_THRESH:
+   val = ssk-zcopy_thresh ? ssk-zcopy_thresh : sdp_zcopy_thresh;
+   break;
default:
return -ENOPROTOOPT;
}
@@ -1051,10 +1074,252 @@ void sdp_push_one(struct sock *sk, unsig
 {
 }
 
-/* Like tcp_sendmsg */
-/* TODO: check locking */
+static struct bzcopy_state *sdp_bz_cleanup(struct bzcopy_state *bz)
+{
+   int i;
+
+   if (bz-pages) {
+   for (i = bz-cur_page; i  bz-page_cnt; i++)
+   put_page(bz-pages[i]);
+
+   kfree(bz-pages);
+   }
+
+   kfree(bz);
+
+   return NULL;
+}
+
+
+static struct bzcopy_state *sdp_bz_setup(struct sdp_sock *ssk,
+unsigned char __user *base,
+int len,
+int size_goal)
+{
+   struct bzcopy_state *bz;
+   unsigned long addr;
+   unsigned long locked, locked_limit;
+   int done_pages;
+   int thresh;
+
+   thresh = ssk-zcopy_thresh ? : sdp_zcopy_thresh;
+   if

[ofa-general] [PATCH 1/4] IPoIB: Fix unused variable warning

2007-10-09 Thread Roland Dreier

The conversion to use netdevice internal stats left an unused variable
in ipoib_neigh_free(), since there's no longer any reason to get
netdev_priv() in order to increment dropped packets.  Delete the
unused priv variable.

Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 6b1b4b2..855c9de 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -854,7 +854,6 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour 
*neighbour)
 
 void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh)
 {
-   struct ipoib_dev_priv *priv = netdev_priv(dev);
struct sk_buff *skb;
*to_ipoib_neigh(neigh-neighbour) = NULL;
while ((skb = __skb_dequeue(neigh-queue))) {
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 2/4] ibm_emac: Convert to use napi_struct independent of struct net_device

2007-10-09 Thread Roland Dreier

Commit da3dedd9 ([NET]: Make NAPI polling independent of struct
net_device objects.) changed the interface to NAPI polling.  Fix up
the ibm_emac driver so that it works with this new interface.  This is
actually a nice cleanup because ibm_emac is one of the drivers that
wants to have multiple NAPI structures for a single net_device.

Tested with the internal MAC of a PowerPC 440SPe SoC with an AMCC
'Yucca' evaluation board.

Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
 drivers/net/ibm_emac/ibm_emac_mal.c |   48 --
 drivers/net/ibm_emac/ibm_emac_mal.h |2 +-
 include/linux/netdevice.h   |   10 +++
 3 files changed, 28 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ibm_emac/ibm_emac_mal.c 
b/drivers/net/ibm_emac/ibm_emac_mal.c
index cabd984..cc3ddc9 100644
--- a/drivers/net/ibm_emac/ibm_emac_mal.c
+++ b/drivers/net/ibm_emac/ibm_emac_mal.c
@@ -207,10 +207,10 @@ static irqreturn_t mal_serr(int irq, void *dev_instance)
 
 static inline void mal_schedule_poll(struct ibm_ocp_mal *mal)
 {
-   if (likely(netif_rx_schedule_prep(mal-poll_dev))) {
+   if (likely(napi_schedule_prep(mal-napi))) {
MAL_DBG2(%d: schedule_poll NL, mal-def-index);
mal_disable_eob_irq(mal);
-   __netif_rx_schedule(mal-poll_dev);
+   __napi_schedule(mal-napi);
} else
MAL_DBG2(%d: already in poll NL, mal-def-index);
 }
@@ -273,11 +273,11 @@ static irqreturn_t mal_rxde(int irq, void *dev_instance)
return IRQ_HANDLED;
 }
 
-static int mal_poll(struct net_device *ndev, int *budget)
+static int mal_poll(struct napi_struct *napi, int budget)
 {
-   struct ibm_ocp_mal *mal = ndev-priv;
+   struct ibm_ocp_mal *mal = container_of(napi, struct ibm_ocp_mal, napi);
struct list_head *l;
-   int rx_work_limit = min(ndev-quota, *budget), received = 0, done;
+   int received = 0;
 
MAL_DBG2(%d: poll(%d) %d - NL, mal-def-index, *budget,
 rx_work_limit);
@@ -295,38 +295,34 @@ static int mal_poll(struct net_device *ndev, int *budget)
list_for_each(l, mal-poll_list) {
struct mal_commac *mc =
list_entry(l, struct mal_commac, poll_list);
-   int n = mc-ops-poll_rx(mc-dev, rx_work_limit);
+   int n = mc-ops-poll_rx(mc-dev, budget);
if (n) {
received += n;
-   rx_work_limit -= n;
-   if (rx_work_limit = 0) {
-   done = 0;
+   budget -= n;
+   if (budget = 0)
goto more_work; // XXX What if this is the last 
one ?
-   }
}
}
 
/* We need to disable IRQs to protect from RXDE IRQ here */
local_irq_disable();
-   __netif_rx_complete(ndev);
+   __napi_complete(napi);
mal_enable_eob_irq(mal);
local_irq_enable();
 
-   done = 1;
-
/* Check for rotting packet(s) */
list_for_each(l, mal-poll_list) {
struct mal_commac *mc =
list_entry(l, struct mal_commac, poll_list);
if (unlikely(mc-ops-peek_rx(mc-dev) || mc-rx_stopped)) {
MAL_DBG2(%d: rotting packet NL, mal-def-index);
-   if (netif_rx_reschedule(ndev, received))
+   if (napi_reschedule(napi))
mal_disable_eob_irq(mal);
else
MAL_DBG2(%d: already in poll list NL,
 mal-def-index);
 
-   if (rx_work_limit  0)
+   if (budget  0)
goto again;
else
goto more_work;
@@ -335,12 +331,8 @@ static int mal_poll(struct net_device *ndev, int *budget)
}
 
   more_work:
-   ndev-quota -= received;
-   *budget -= received;
-
-   MAL_DBG2(%d: poll() %d - %d NL, mal-def-index, *budget,
-done ? 0 : 1);
-   return done ? 0 : 1;
+   MAL_DBG2(%d: poll() %d - %d NL, mal-def-index, budget, received);
+   return received;
 }
 
 static void mal_reset(struct ibm_ocp_mal *mal)
@@ -425,11 +417,8 @@ static int __init mal_probe(struct ocp_device *ocpdev)
mal-def = ocpdev-def;
 
INIT_LIST_HEAD(mal-poll_list);
-   set_bit(__LINK_STATE_START, mal-poll_dev.state);
-   mal-poll_dev.weight = CONFIG_IBM_EMAC_POLL_WEIGHT;
-   mal-poll_dev.poll = mal_poll;
-   mal-poll_dev.priv = mal;
-   atomic_set(mal-poll_dev.refcnt, 1);
+   mal-napi.weight = CONFIG_IBM_EMAC_POLL_WEIGHT;
+   mal-napi.poll = mal_poll;
 
INIT_LIST_HEAD(mal-list);
 
@@ -520,11 +509,8 @@ static void __exit mal_remove(struct ocp_device *ocpdev)
 
MAL_DBG(%d: remove NL,

[ofa-general] [PATCH 3/4] ibm_new_emac: Nuke SET_MODULE_OWNER() use

2007-10-09 Thread Roland Dreier

Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
 drivers/net/ibm_newemac/core.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ibm_newemac/core.c b/drivers/net/ibm_newemac/core.c
index ce127b9..8ea5009 100644
--- a/drivers/net/ibm_newemac/core.c
+++ b/drivers/net/ibm_newemac/core.c
@@ -2549,7 +2549,6 @@ static int __devinit emac_probe(struct of_device *ofdev,
dev-ndev = ndev;
dev-ofdev = ofdev;
dev-blist = blist;
-   SET_MODULE_OWNER(ndev);
SET_NETDEV_DEV(ndev, ofdev-dev);
 
/* Initialize some embedded data structures */
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 4/4] ibm_emac: Convert to use napi_struct independent of struct net_device

2007-10-09 Thread Roland Dreier

Commit da3dedd9 ([NET]: Make NAPI polling independent of struct
net_device objects.) changed the interface to NAPI polling.  Fix up
the ibm_newemac driver so that it works with this new interface.  This
is actually a nice cleanup because ibm_newemac is one of the drivers
that wants to have multiple NAPI structures for a single net_device.

Compile-tested only as I don't have a system that uses the ibm_newemac
driver.  This conversion the conversion for the ibm_emac driver that
was tested on real PowerPC 440SPe hardware.

Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
 drivers/net/ibm_newemac/mal.c |   55 ++--
 drivers/net/ibm_newemac/mal.h |2 +-
 2 files changed, 20 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ibm_newemac/mal.c b/drivers/net/ibm_newemac/mal.c
index c4335b7..5885411 100644
--- a/drivers/net/ibm_newemac/mal.c
+++ b/drivers/net/ibm_newemac/mal.c
@@ -235,10 +235,10 @@ static irqreturn_t mal_serr(int irq, void *dev_instance)
 
 static inline void mal_schedule_poll(struct mal_instance *mal)
 {
-   if (likely(netif_rx_schedule_prep(mal-poll_dev))) {
+   if (likely(napi_schedule_prep(mal-napi))) {
MAL_DBG2(mal, schedule_poll NL);
mal_disable_eob_irq(mal);
-   __netif_rx_schedule(mal-poll_dev);
+   __napi_schedule(mal-napi);
} else
MAL_DBG2(mal, already in poll NL);
 }
@@ -318,8 +318,7 @@ void mal_poll_disable(struct mal_instance *mal, struct 
mal_commac *commac)
msleep(1);
 
/* Synchronize with the MAL NAPI poller. */
-   while (test_bit(__LINK_STATE_RX_SCHED, mal-poll_dev.state))
-   msleep(1);
+   napi_disable(mal-napi);
 }
 
 void mal_poll_enable(struct mal_instance *mal, struct mal_commac *commac)
@@ -330,11 +329,11 @@ void mal_poll_enable(struct mal_instance *mal, struct 
mal_commac *commac)
// XXX might want to kick a poll now...
 }
 
-static int mal_poll(struct net_device *ndev, int *budget)
+static int mal_poll(struct napi_struct *napi, int budget)
 {
-   struct mal_instance *mal = netdev_priv(ndev);
+   struct mal_instance *mal = container_of(napi, struct mal_instance, 
napi);
struct list_head *l;
-   int rx_work_limit = min(ndev-quota, *budget), received = 0, done;
+   int received = 0;
unsigned long flags;
 
MAL_DBG2(mal, poll(%d) %d - NL, *budget,
@@ -358,26 +357,21 @@ static int mal_poll(struct net_device *ndev, int *budget)
int n;
if (unlikely(test_bit(MAL_COMMAC_POLL_DISABLED, mc-flags)))
continue;
-   n = mc-ops-poll_rx(mc-dev, rx_work_limit);
+   n = mc-ops-poll_rx(mc-dev, budget);
if (n) {
received += n;
-   rx_work_limit -= n;
-   if (rx_work_limit = 0) {
-   done = 0;
-   // XXX What if this is the last one ?
-   goto more_work;
-   }
+   budget -= n;
+   if (budget = 0)
+   goto more_work; // XXX What if this is the last 
one ?
}
}
 
/* We need to disable IRQs to protect from RXDE IRQ here */
spin_lock_irqsave(mal-lock, flags);
-   __netif_rx_complete(ndev);
+   __napi_complete(napi);
mal_enable_eob_irq(mal);
spin_unlock_irqrestore(mal-lock, flags);
 
-   done = 1;
-
/* Check for rotting packet(s) */
list_for_each(l, mal-poll_list) {
struct mal_commac *mc =
@@ -387,12 +381,12 @@ static int mal_poll(struct net_device *ndev, int *budget)
if (unlikely(mc-ops-peek_rx(mc-dev) ||
 test_bit(MAL_COMMAC_RX_STOPPED, mc-flags))) {
MAL_DBG2(mal, rotting packet NL);
-   if (netif_rx_reschedule(ndev, received))
+   if (napi_reschedule(napi))
mal_disable_eob_irq(mal);
else
MAL_DBG2(mal, already in poll list NL);
 
-   if (rx_work_limit  0)
+   if (budget  0)
goto again;
else
goto more_work;
@@ -401,13 +395,8 @@ static int mal_poll(struct net_device *ndev, int *budget)
}
 
  more_work:
-   ndev-quota -= received;
-   *budget -= received;
-
-   MAL_DBG2(mal, poll() %d - %d NL, *budget,
-done ? 0 : 1);
-
-   return done ? 0 : 1;
+   MAL_DBG2(mal, poll() %d - %d NL, budget, received);
+   return received;
 }
 
 static void mal_reset(struct mal_instance *mal)
@@ -538,11 +527,8 @@ static int __devinit mal_probe(struct of_device *ofdev,
}

Re: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch

2007-10-09 Thread Sean Hefty

I skipped over most of the code restructuring comments and focus mainly 
on design or issues.  (Although code restructuring patches tend not to 
be written or easily accepted unless they fix a bug, and I would 
personally like to see at least some of the ones previously mentioned 
addressed before this code is merged.  The ones listed below should be 
trivial to incorporate before merging.)



This version incorporates some of Sean's comments, especially
relating to locking.

Sean's comments regarding module parameters, code restructure, 
ipoib_cm_rx state and the like will require more discussion and 
subsequent testing. They will be addressed with additional set 
of patches later on.


This patch has been tested with linux-2.6.23-rc5 derived from Roland's
for-2.6.24 git tree on ppc64 machines using IBM HCA.

Signed-off-by: Pradeep Satyanarayana [EMAIL PROTECTED]
--- 


--- a/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-31 
12:14:30.0 -0500
+++ b/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-18 
14:31:07.0 -0500
@@ -95,11 +95,13 @@ enum {
IPOIB_MCAST_FLAG_ATTACHED = 3,
 };
 
+#define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE))

 #defineIPOIB_OP_RECV   (1ul  31)
+
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
-#defineIPOIB_CM_OP_SRQ (1ul  30)
+#defineIPOIB_CM_OP_RECV (1ul  30)
 #else
-#defineIPOIB_CM_OP_SRQ (0)
+#defineIPOIB_CM_OP_RECV (0)
 #endif
 
 /* structs */

@@ -166,11 +168,14 @@ enum ipoib_cm_state {
 };
 
 struct ipoib_cm_rx {

-   struct ib_cm_id *id;
-   struct ib_qp*qp;
-   struct list_head list;
-   struct net_device   *dev;
-   unsigned longjiffies;
+   struct ib_cm_id *id;
+   struct ib_qp*qp;
+   struct ipoib_cm_rx_buf  *rx_ring; /* Used by no srq only */
+   struct list_head list;
+   struct net_device   *dev;
+   unsigned longjiffies;
+   u32  index; /* wr_ids are distinguished by index
+* to identify the QP -no srq only */
enum ipoib_cm_state  state;
 };
 
@@ -215,6 +220,8 @@ struct ipoib_cm_dev_priv {

struct ib_wcibwc[IPOIB_NUM_WC];
struct ib_sge   rx_sge[IPOIB_CM_RX_SG];
struct ib_recv_wr   rx_wr;
+   struct ipoib_cm_rx  **rx_index_table; /* See ipoib_cm_dev_init()
+  *for usage of this element */
 };
 
 /*

@@ -438,6 +445,7 @@ void ipoib_drain_cq(struct net_device *d
 /* We don't support UC connections at the moment */
 #define IPOIB_CM_SUPPORTED(ha)   (ha[0]  (IPOIB_FLAGS_RC))
 
+extern int max_rc_qp;

 static inline int ipoib_cm_admin_enabled(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
--- a/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_cm.c  2007-07-31 
12:14:30.0 -0500
+++ b/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_cm.c  2007-09-18 
17:04:06.0 -0500
@@ -49,6 +49,18 @@ MODULE_PARM_DESC(cm_data_debug_level,
 
 #include ipoib.h
 
+int max_rc_qp = 128;

+static int max_recv_buf = 1024; /* Default is 1024 MB */
+
+module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0444);
+MODULE_PARM_DESC(nosrq_max_rc_qp, Max number of no srq RC QPs supported; must be a 
power of 2);


I thought you were going to remove the power of 2 restriction.

And to re-start this discussion, I think we should separate the maximum 
number of QPs from whether we use SRQ, and let the QP type (UD, UC, RC) 
be  controllable.  Smaller clusters may perform better without using 
SRQ, even if it is available.  And supporting UC versus RC seems like it 
should only take a few additional lines of code.



+module_param_named(max_receive_buffer, max_recv_buf, int, 0644);
+MODULE_PARM_DESC(max_receive_buffer, Max Receive Buffer Size in MB);


Based on your response to my feedback, it sounds like the only reason 
we're keeping this parameter around is in case the admin sets some of 
the other values (max QPs, message size, RQ size) incorrectly.


I agree with Roland that we need to come up with the correct user 
interface here, and I'm not convinced that what we have is the most 
adaptable for where the code could go.  What about replacing the 2 
proposed parameters with these 3?


qp_type - ud, uc, rc
use_srq - yes/no (default if available)
max_conn_qp - uc or rc limit


+
+static atomic_t current_rc_qp = ATOMIC_INIT(0); /* Active number of RC QPs for 
no srq */
+
+#define NOSRQ_INDEX_MASK  (max_rc_qp -1)


Just reserve lower bits of the wr_id for the rx_table to avoid the power 
of 2 restriction.



 #define IPOIB_CM_IETF_ID 0x1000ULL
 
 #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ)

@@ -81,20 +93,21 @@ static void ipoib_cm_dma_unmap_rx(struct
ib_dma_unmap_single(priv-ca, mapping[i + 1], PAGE_SIZE, 
DMA_FROM_DEVICE);
 }
 
-static int

Re: [ofa-general] [PATCH 4/4] ibm_emac: Convert to use napi_struct independent of struct net_device

2007-10-09 Thread Roland Dreier

Sorry... wrong subject here; it should have been ibm_newemac: ...

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 1/4] IPoIB: Fix unused variable warning

2007-10-09 Thread David Miller

From: Roland Dreier [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 15:46:13 -0700

 The conversion to use netdevice internal stats left an unused variable
 in ipoib_neigh_free(), since there's no longer any reason to get
 netdev_priv() in order to increment dropped packets.  Delete the
 unused priv variable.

 Signed-off-by: Roland Dreier [EMAIL PROTECTED]

Jeff, do you want to merge in Roland's 4 patches to your tree then do
a sync with me so I can pull it all in from you?

Alternative I can merge in Roland's work directly if that's easier
for you.

Just let me know.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch

2007-10-09 Thread Pradeep Satyanarayana

Sean, Roland,

I looked through Sean's latest comments. Yes, they are fairly easy to fix
and I will fix them. The only one that might need some debate is the one
associated with module parameters. In previous communications with Roland
I got the impression that he wants to keep them (module parameters) at a 
minimum. So, how do we address that now?

Last time around (after Sean's comments) I just addressed the bugs and
skipped the rest since I had no idea as to how much time I had for the 
merge. These days I do not have exclusive access to the machines with IB 
adapters limiting the work I can do at a stretch.

How much time do I have before this gets merged into the 2.6.24 tree? Other
than the module parameters one I should be able to address the rest either
by this evening (west coast US) or maybe in the morning/afternoon tomorrow.
Will that be acceptable?

Pradeep


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller

From: jamal [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 17:56:46 -0400

 if the h/ware queues are full because of link pressure etc, you drop. We
 drop today when the s/ware queues are full. The driver txmit lock takes
 place of the qdisc queue lock etc. I am assuming there is still need for
 that locking. The filter/classification scheme still works as is and
 select classes which map to rings. tc still works as is etc.

I understand your suggestion.

We have to keep in mind, however, that the sw queue right now is 1000
packets.  I heavily discourage any driver author to try and use any
single TX queue of that size.  Which means that just dropping on back
pressure might not work so well.

Or it might be perfect and signal TCP to backoff, who knows! :-)

While working out this issue in my mind, it occured to me that we
can put the sw queue into the driver as well.

The idea is that the network stack, as in the pure hw queue scheme,
unconditionally always submits new packets to the driver.  Therefore
even if the hw TX queue is full, the driver can still queue to an
internal sw queue with some limit (say 1000 for ethernet, as is used
now).

When the hw TX queue gains space, the driver self-batches packets
from the sw queue to the hw queue.

It sort of obviates the need for mid-level queue batching in the
generic networking.  Compared to letting the driver self-batch,
the mid-level batching approach is pure overhead.

We seem to be sort of all mentioning similar ideas.  For example, you
can get the above kind of scheme today by using a mid-level queue
length of zero, and I believe this idea was mentioned by Stephen
Hemminger earlier.

I may experiment with this in the NIU driver.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch

2007-10-09 Thread Roland Dreier

  And to re-start this discussion, I think we should separate the
  maximum number of QPs from whether we use SRQ, and let the QP type
  (UD, UC, RC) be  controllable.  Smaller clusters may perform better
  without using SRQ, even if it is available.  And supporting UC versus
  RC seems like it should only take a few additional lines of code.

Actually supporting UC is trickier than it seems, at least for the SRQ
case, since attaching UC QPs to an SRQ requires that the IB spec be
extended to allow that (and also define some semantics for how to
handle messages that encounter an error in the middle of being
received, after a work request has been taken from the SRQ).

  I agree with Roland that we need to come up with the correct user
  interface here, and I'm not convinced that what we have is the most
  adaptable for where the code could go.  What about replacing the 2
  proposed parameters with these 3?
  
  qp_type - ud, uc, rc
  use_srq - yes/no (default if available)
  max_conn_qp - uc or rc limit

I don't think we want the qp_type to be a module parameter -- it seems
we already have ud vs. rc handled via the parameter that enables
connected mode, and if we want to enable uc we should do that in a
similar per-interface way.

Similarly if there's any point to making use_srq something that can be
controlled, ideally it should be per-interface.  But this could be
tricky because it may be hard to change at runtime.

(Ideally max_conn_qp would be per-interface too but that seems too
hard as well)

I do agree that the memory limit just seems arbitrary and we can
probably do away with that.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 1/4] IPoIB: Fix unused variable warning

2007-10-09 Thread Jeff Garzik

David Miller wrote:

From: Roland Dreier [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 15:46:13 -0700

The conversion to use netdevice internal stats left an unused variable
in ipoib_neigh_free(), since there's no longer any reason to get
netdev_priv() in order to increment dropped packets.  Delete the
unused priv variable.

Signed-off-by: Roland Dreier [EMAIL PROTECTED]

Jeff, do you want to merge in Roland's 4 patches to your tree then do
a sync with me so I can pull it all in from you?

Grabbing them now...

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 1/4] IPoIB: Fix unused variable warning

2007-10-09 Thread Jeff Garzik


Roland Dreier wrote:

The conversion to use netdevice internal stats left an unused variable
in ipoib_neigh_free(), since there's no longer any reason to get
netdev_priv() in order to increment dropped packets.  Delete the
unused priv variable.

Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)


applied 1-4


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller

From: Andi Kleen [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 02:37:16 +0200

 On Tue, Oct 09, 2007 at 05:04:35PM -0700, David Miller wrote:
  We have to keep in mind, however, that the sw queue right now is 1000
  packets.  I heavily discourage any driver author to try and use any
  single TX queue of that size.  

 Why would you discourage them? 

 If 1000 is ok for a software queue why would it not be ok
 for a hardware queue?

Because with the software queue, you aren't accessing 1000 slots
shared with the hardware device which does shared-ownership
transactions on those L2 cache lines with the cpu.

Long ago I did a test on gigabit on a cpu with only 256K of
L2 cache.  Using a smaller TX queue make things go faster,
and it's exactly because of these L2 cache effects.

 1000 packets is a lot. I don't have hard data, but gut feeling 
 is less would also do.

I'll try to see how backlogged my 10Gb tests get when a strong
sender is sending to a weak receiver.

 And if the hw queues are not enough a better scheme might be to
 just manage this in the sockets in sendmsg. e.g. provide a wait queue that
 drivers can wake up and let them block on more queue.

TCP does this already, but it operates in a lossy manner.

 I don't really see the advantage over the qdisc in that scheme.
 It's certainly not simpler and probably more code and would likely
 also not require less locks (e.g. a currently lockless driver
 would need a new lock for its sw queue). Also it is unclear to me
 it would be really any faster.

You still need a lock to guard hw TX enqueue from hw TX reclaim.

A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you
increase the size much more performance starts to go down due to L2
cache thrashing.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Andi Kleen

On Tue, Oct 09, 2007 at 05:04:35PM -0700, David Miller wrote:
 We have to keep in mind, however, that the sw queue right now is 1000
 packets.  I heavily discourage any driver author to try and use any
 single TX queue of that size.  

Why would you discourage them? 

If 1000 is ok for a software queue why would it not be ok
for a hardware queue?

 Which means that just dropping on back
 pressure might not work so well.
 
 Or it might be perfect and signal TCP to backoff, who knows! :-)

1000 packets is a lot. I don't have hard data, but gut feeling 
is less would also do.

And if the hw queues are not enough a better scheme might be to
just manage this in the sockets in sendmsg. e.g. provide a wait queue that
drivers can wake up and let them block on more queue.

 The idea is that the network stack, as in the pure hw queue scheme,
 unconditionally always submits new packets to the driver.  Therefore
 even if the hw TX queue is full, the driver can still queue to an
 internal sw queue with some limit (say 1000 for ethernet, as is used
 now).

 
 When the hw TX queue gains space, the driver self-batches packets
 from the sw queue to the hw queue.

I don't really see the advantage over the qdisc in that scheme.
It's certainly not simpler and probably more code and would likely
also not require less locks (e.g. a currently lockless driver
would need a new lock for its sw queue). Also it is unclear to me
it would be really any faster.

-Andi

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch

2007-10-09 Thread Pradeep Satyanarayana


 
 I do agree that the memory limit just seems arbitrary and we can
 probably do away with that.

We discussed this previously and had agreed upon limiting the memory
foot print to 1GB by default. This module parameter was for larger
systems that had plenty of memory and could afford to use more.
This way the sys admin could increase the limit.

Hence I am not really in favour of removing this.

Pradeep


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] SDP ?

2007-10-09 Thread Jim Langston


Hi Jim,

Thanks, tried early on with -D_XPG4_2, things went from bad to worse, 
I'll look at switching from

int to void.

Jim

//

Jim Mott wrote:

That should work fine.  You might be able to build with -D_XPG4_2 as well.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jim Langston
Sent: Tuesday, October 09, 2007 10:13 AM
To: general@lists.openfabrics.org
Subject: [ofa-general] SDP ?

Hi all,

I'm working on porting SDP to OpenSolaris and am looking at a
compile error that I get. Essentially, I have a conflict of types on
the compile:

bash-3.00$ /opt/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I. -I.. -g 
-D_POSIX_PTHREAD_SEMANTICS -DSYSCONFDIR=\/usr/local/etc\ -g 
-D_POSIX_PTHREAD_SEMANTICS -c port.c  -KPIC -DPIC -o .libs/port.o

port.c, line 1896: identifier redeclared: getsockname
current : function(int, pointer to struct sockaddr {unsigned 
short sa_family, array[14] of char sa_data}, pointer to unsigned int) 
returning int
previous: function(int, pointer to struct sockaddr {unsigned 
short sa_family, array[14] of char sa_data}, pointer to void) returning 
int : /usr/include/sys/socket.h, line 436



Line 436 in /usr/include/sys/socket.h

extern int getsockname(int, struct sockaddr *_RESTRICT_KYWD, Psocklen_t);


and Psocklen_t

#if defined(_XPG4_2) || defined(_BOOT)
typedef socklen_t   *_RESTRICT_KYWD Psocklen_t;
#else
typedef void*_RESTRICT_KYWD Psocklen_t;
#endif  /* defined(_XPG4_2) || defined(_BOOT) */


Do I need to change port.c getsockname to type void * ?


Thanks,

Jim
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
  



--
/

Jim Langston
Sun Microsystems, Inc.

(877) 854-5583 (AccessLine)
AIM: jl9594
[EMAIL PROTECTED]

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 01/23] IB/ipath -- iba6110 rev4 GPIO counters support

2007-10-09 Thread Arthur Jones

hi roland,  i didn't realize it was such
a PITA for you to take so many at once.
i'll make sure to do them in smaller chunks
from now on.

thanks for taking these...

arthur

On Tue, Oct 09, 2007 at 02:55:31PM -0700, Roland Dreier wrote:
 OK, I'll grudgingly merge these patch, even though they all arrived on
 the exact day that Linus released 2.6.23... but you guys really need
 to fix your development process so you don't accumulate a huge bolus
 of patches that you then vomit out.  In the future I'm not going to
 accept giant merges like this -- please send your patches as soon as
 you've accumulated say 5 or 10.
 
  - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 01/23] IB/ipath -- iba6110 rev4 GPIO counters support

2007-10-09 Thread Roland Dreier

  hi roland,  i didn't realize it was such
  a PITA for you to take so many at once.
  i'll make sure to do them in smaller chunks
  from now on.

Thanks.  The reason its a pain is that it's a lot harder to review a
ton of patches when they come late like this.  Just send the patches
as you write them and you have less of a queue to worry about and I
can manage my queue a lot better.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] IPoIB CM (NOSRQ) [Patch V9] revised

2007-10-09 Thread Pradeep Satyanarayana

This revised version incorporates Sean's comments. The module
parameters are unchanged except the restriction on max_rc_qp
(that it should be power of 2) has been removed.

This patch has been tested with linux-2.6.23-rc7 (derived from Roland's
for-2.6.24 git tree) on ppc64 machines using IBM HCA.

Signed-off-by: Pradeep Satyanarayana [EMAIL PROTECTED]
--- 

--- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-03 
12:01:58.0 -0500
+++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-09 
19:42:51.0 -0500
@@ -69,6 +69,7 @@ enum {
IPOIB_TX_RING_SIZE= 64,
IPOIB_MAX_QUEUE_SIZE  = 8192,
IPOIB_MIN_QUEUE_SIZE  = 2,
+   IPOIB_MAX_RC_QP   = 4096,
 
IPOIB_NUM_WC  = 4,
 
@@ -95,11 +96,13 @@ enum {
IPOIB_MCAST_FLAG_ATTACHED = 3,
 };
 
+#define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE))
 #defineIPOIB_OP_RECV   (1ul  31)
+
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
-#defineIPOIB_CM_OP_SRQ (1ul  30)
+#defineIPOIB_CM_OP_RECV (1ul  30)
 #else
-#defineIPOIB_CM_OP_SRQ (0)
+#defineIPOIB_CM_OP_RECV (0)
 #endif
 
 /* structs */
@@ -186,11 +189,14 @@ enum ipoib_cm_state {
 };
 
 struct ipoib_cm_rx {
-   struct ib_cm_id *id;
-   struct ib_qp*qp;
-   struct list_head list;
-   struct net_device   *dev;
-   unsigned longjiffies;
+   struct ib_cm_id *id;
+   struct ib_qp*qp;
+   struct ipoib_cm_rx_buf  *rx_ring; /* Used by no srq only */
+   struct list_head list;
+   struct net_device   *dev;
+   unsigned longjiffies;
+   u32  index; /* wr_ids are distinguished by index
+* to identify the QP -no srq only */
enum ipoib_cm_state  state;
 };
 
@@ -235,6 +241,8 @@ struct ipoib_cm_dev_priv {
struct ib_wcibwc[IPOIB_NUM_WC];
struct ib_sge   rx_sge[IPOIB_CM_RX_SG];
struct ib_recv_wr   rx_wr;
+   struct ipoib_cm_rx  **rx_index_table; /* See ipoib_cm_dev_init()
+  *for usage of this element */
 };
 
 /*
@@ -458,6 +466,7 @@ void ipoib_drain_cq(struct net_device *d
 /* We don't support UC connections at the moment */
 #define IPOIB_CM_SUPPORTED(ha)   (ha[0]  (IPOIB_FLAGS_RC))
 
+extern int max_rc_qp;
 static inline int ipoib_cm_admin_enabled(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
--- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c  2007-07-31 
12:14:30.0 -0500
+++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c  2007-10-09 
21:15:25.0 -0500
@@ -49,6 +49,18 @@ MODULE_PARM_DESC(cm_data_debug_level,
 
 #include ipoib.h
 
+int max_rc_qp = 128;
+static int max_recv_buf = 1024; /* Default is 1024 MB */
+
+module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0444);
+MODULE_PARM_DESC(nosrq_max_rc_qp, Max number of no srq RC QPs supported);
+
+module_param_named(max_receive_buffer, max_recv_buf, int, 0644);
+MODULE_PARM_DESC(max_receive_buffer, Max Receive Buffer Size in MB);
+
+static atomic_t current_rc_qp = ATOMIC_INIT(0); /* Active number of RC QPs for 
no srq */
+
+#define NOSRQ_INDEX_MASK  (0xfff) /* This corresponds to a max of 4096 QPs 
for no srq */
 #define IPOIB_CM_IETF_ID 0x1000ULL
 
 #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ)
@@ -81,20 +93,21 @@ static void ipoib_cm_dma_unmap_rx(struct
ib_dma_unmap_single(priv-ca, mapping[i + 1], PAGE_SIZE, 
DMA_FROM_DEVICE);
 }
 
-static int ipoib_cm_post_receive(struct net_device *dev, int id)
+static int post_receive_srq(struct net_device *dev, u64 id)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct ib_recv_wr *bad_wr;
int i, ret;
 
-   priv-cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ;
+   priv-cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV;
 
for (i = 0; i  IPOIB_CM_RX_SG; ++i)
priv-cm.rx_sge[i].addr = priv-cm.srq_ring[id].mapping[i];
 
ret = ib_post_srq_recv(priv-cm.srq, priv-cm.rx_wr, bad_wr);
if (unlikely(ret)) {
-   ipoib_warn(priv, post srq failed for buf %d (%d)\n, id, ret);
+   ipoib_warn(priv, post srq failed for buf %lld (%d)\n,
+  (unsigned long long)id, ret);
ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
  priv-cm.srq_ring[id].mapping);
dev_kfree_skb_any(priv-cm.srq_ring[id].skb);
@@ -104,12 +117,47 @@ static int ipoib_cm_post_receive(struct 
return ret;
 }
 
-static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, 
int frags,
+static int post_receive_nosrq(struct net_device *dev, u64 id)
+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+   struct ib_recv_wr *bad_wr;
+   int i, ret;
+   u32 index;

[ofa-general] nightly osm_sim report 2007-10-10:normal completion

2007-10-09 Thread kliteyn

OSM Simulation Regression Summary
 
[Generated mail - please do NOT reply]
 
 
OpenSM binary date = 2007-10-09
OpenSM git rev = Tue_Oct_2_22:28:56_2007 
[d5c34ddc158599abff9f09a6cc6c8cad67745f0b]
ibutils git rev = Tue_Sep_4_17:57:34_2007 
[4bf283f6a0d7c0264c3a1d2de92745e457585fdb]
 
 
Total=520  Pass=520  Fail=0
 
 
Pass:
39 Stability IS1-16.topo
39 Pkey IS1-16.topo
39 OsmTest IS1-16.topo
39 OsmStress IS1-16.topo
39 Multicast IS1-16.topo
39 LidMgr IS1-16.topo
13 Stability IS3-loop.topo
13 Stability IS3-128.topo
13 Pkey IS3-128.topo
13 OsmTest IS3-loop.topo
13 OsmTest IS3-128.topo
13 OsmStress IS3-128.topo
13 Multicast IS3-loop.topo
13 Multicast IS3-128.topo
13 LidMgr IS3-128.topo
13 FatTree merge-roots-4-ary-2-tree.topo
13 FatTree merge-root-4-ary-3-tree.topo
13 FatTree gnu-stallion-64.topo
13 FatTree blend-4-ary-2-tree.topo
13 FatTree RhinoDDR.topo
13 FatTree FullGnu.topo
13 FatTree 4-ary-2-tree.topo
13 FatTree 2-ary-4-tree.topo
13 FatTree 12-node-spaced.topo
13 FTreeFail 4-ary-2-tree-missing-sw-link.topo
13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo
13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo
13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo

Failures:
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

96 matches

Mail list logo