Re: [ofa-general] Re: [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver
Jay Vosburgh wrote: Jeff Garzik [EMAIL PROTECTED] wrote: Moni Shoua wrote: Jay Vosburgh wrote: ACK patches 3 - 9. Roland, are you comfortable with the IB changes in patches 1 and 2? Jeff, when Roland acks patches 1 and 2, please apply all 9. -J Hi Jeff, Roland acked the IPoIB patches. If you haven't done so already can you please apply them? I'm not sure when 2.6.24 is going to open and I'm afraid to miss it. hrm, I don't see them in my inbox for some reason. can someone bounce them to me? or give me a git tree to pull from? Moni, can you repost the patch series to Jeff, and put the appropriate Acked-by lines in for myself (patches 3 - 8) and Roland (patches 1 and 2)? You can probably leave off the netdev and openfabrics lists, but cc me. -J --- -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED] Hi Jeff, I don't commits of the patches in http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=summary (I hope that I'm looking in the right place). Did you get them? thanks MoniS ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] question regarding umad_recv
Hi, It is regarding *umad_recv* function of libibumad/src/umad.c file. Is it not possible to recv MAD specific to GSI or SMI type. As per my impression if I have two separate threads to send and receive then I could send MADs to different qp 0 or 1 depend on GSI and SMI MAD. But receiving has no control over it. Please suggest if there is any workaround for it. Thanks and Regards sumit ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCHES] TX batching
J Hadi Salim [EMAIL PROTECTED] wrote on 10/08/2007 07:35:20 PM: I dont see something from Krishna's approach that i can take and reuse. This maybe because my old approaches have evolved from the same path. There is a long list but as a sample: i used to do a lot more work while holding the queue lock which i have now moved post queue lock; i dont have any speacial interfaces/tricks just for batching, i provide hints to the core of how much the driver can take etc etc. I have offered Krishna co-authorship if he makes the IPOIB driver to work on my patches, that offer still stands if he chooses to take it. My feeling is that since the approaches are very different, it would be a good idea to test the two for performance. Do you mind me doing that? Ofcourse others and/or you are more than welcome to do the same. I had sent a note to you yesterday about this, please let me know either way. *** Previous mail ** Hi Jamal, If you don't mind, I am trying to run your approach vs mine to get some results for comparison. For starters, I am having issues with iperf when using your infrastructure code with my IPoIB driver - about 100MB is sent and then everything stops for some reason. The changes in the IPoIB driver that I made to support batching is to set BTX, set xmit_win, and dynamically reduce xmit_win on every xmit and increase xmit_win on every xmit completion. Is there anything else that is required from the driver? thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH 13 of 17]: add LRO support
Eli Cohen wrote: Since you have posted the patch, I am asking you if it has any negative influence on packet forwarding. I am not asking you to test it or whether you tested it with forwarding. The answer is yes since I do not recalculate TCP checksum as I aggregate the SKBs so the kernel might forward the TCP segment as multiple IP packets but with wrong TCP checksum (which is that of the first aggregated packet) but not of the overall aggregated segment. OK, thanks for this clarification. Can you clarify if/how this patch is related to the lro: Generic Large Receive Offload for TCP traffic RFC sent on August this year to netdev (eg see http://lwn.net/Articles/244206) ? Assuming LRO is a --pure software-- optimization, what's the rational to put its whole implementation in the ipoib driver and not divide it to general part implemented in the net core and per driver part implemented per device driver that wants to support LRO (if such second part is needed at all)? If I am wrong and their is some LRO assistance from the connectX HW, what is it doing? Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 1/3] osm: QoS- bug in opening policy file
Fixing bug in opening QoS policy file Signed-off-by: Yevgeny Kliteynik [EMAIL PROTECTED] --- opensm/opensm/osm_qos_parser.y |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y index e0faaaf..8e9f282 100644 --- a/opensm/opensm/osm_qos_parser.y +++ b/opensm/opensm/osm_qos_parser.y @@ -50,6 +50,7 @@ #include stdlib.h #include string.h #include ctype.h +#include errno.h #include sys/stat.h #include opensm/osm_opensm.h #include opensm/osm_qos_policy.h @@ -129,6 +130,7 @@ extern char * __qos_parser_text; extern void __qos_parser_error (char *s); extern int __qos_parser_lex (void); extern FILE * __qos_parser_in; +extern int errno; #define RESET_BUFFER __parser_tmp_struct_reset() @@ -1750,13 +1752,13 @@ int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn) osm_qos_policy_destroy(p_subn-p_qos_policy); p_subn-p_qos_policy = NULL; -if (!stat(p_subn-opt.qos_policy_file, statbuf)) { +if (stat(p_subn-opt.qos_policy_file, statbuf)) { if (strcmp(p_subn-opt.qos_policy_file,OSM_DEFAULT_QOS_POLICY_FILE)) { osm_log(p_qos_parser_osm_log, OSM_LOG_ERROR, osm_qos_parse_policy_file: ERR AC01: -QoS policy file not found (%s)\n, -p_subn-opt.qos_policy_file); +Failed opening QoS policy file %s - %s\n, +p_subn-opt.qos_policy_file, strerror(errno)); res = 1; } else -- 1.5.1.4 ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 2/3] osm: QoS - fixing memory leaks
Fixing bunch of memory leaks and pointer mismatches in QoS. Signed-off-by: Yevgeny Kliteynik [EMAIL PROTECTED] --- opensm/opensm/osm_qos_parser.l | 16 opensm/opensm/osm_qos_parser.y | 15 --- opensm/opensm/osm_qos_policy.c | 21 + 3 files changed, 33 insertions(+), 19 deletions(-) diff --git a/opensm/opensm/osm_qos_parser.l b/opensm/opensm/osm_qos_parser.l index 0b096f8..60b2d1c 100644 --- a/opensm/opensm/osm_qos_parser.l +++ b/opensm/opensm/osm_qos_parser.l @@ -260,33 +260,41 @@ WHITE_DOTDOT_WHITE [ \t]*:[ \t]* - { SAVE_POS; -__qos_parser_lval = strdup(__qos_parser_text); if (in_description || in_list_of_strings || in_single_string) +{ +__qos_parser_lval = strdup(__qos_parser_text); return TK_TEXT; +} return TK_DASH; } : { SAVE_POS; -__qos_parser_lval = strdup(__qos_parser_text); if (in_description || in_list_of_strings || in_single_string) +{ +__qos_parser_lval = strdup(__qos_parser_text); return TK_TEXT; +} return TK_DOTDOT; } , { SAVE_POS; -__qos_parser_lval = strdup(__qos_parser_text); if (in_description) +{ +__qos_parser_lval = strdup(__qos_parser_text); return TK_TEXT; +} return TK_COMMA; } \* { SAVE_POS; -__qos_parser_lval = strdup(__qos_parser_text); if (in_description || in_list_of_strings || in_single_string) +{ +__qos_parser_lval = strdup(__qos_parser_text); return TK_TEXT; +} return TK_ASTERISK; } diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y index 8e9f282..2405519 100644 --- a/opensm/opensm/osm_qos_parser.y +++ b/opensm/opensm/osm_qos_parser.y @@ -2105,15 +2105,15 @@ static void __sort_reduce_rangearr( unsigned last_valid_ind = 0; unsigned valid_cnt = 0; uint64_t ** res_arr; -boolean_t * is_valir_arr; +boolean_t * is_valid_arr; *p_res_arr = NULL; *p_res_arr_len = 0; qsort(arr, arr_len, sizeof(uint64_t*), __cmp_num_range); -is_valir_arr = (boolean_t *)malloc(arr_len * sizeof(boolean_t)); -is_valir_arr[last_valid_ind] = TRUE; +is_valid_arr = (boolean_t *)malloc(arr_len * sizeof(boolean_t)); +is_valid_arr[last_valid_ind] = TRUE; valid_cnt++; for (i = 1; i arr_len; i++) { @@ -2123,18 +2123,18 @@ static void __sort_reduce_rangearr( arr[last_valid_ind][1] = arr[i][1]; free(arr[i]); arr[i] = NULL; -is_valir_arr[i] = FALSE; +is_valid_arr[i] = FALSE; } else if ((arr[i][0] - 1) == arr[last_valid_ind][1]) { arr[last_valid_ind][1] = arr[i][1]; free(arr[i]); arr[i] = NULL; -is_valir_arr[i] = FALSE; +is_valid_arr[i] = FALSE; } else { -is_valir_arr[i] = TRUE; +is_valid_arr[i] = TRUE; last_valid_ind = i; valid_cnt++; } @@ -2143,9 +2143,10 @@ static void __sort_reduce_rangearr( res_arr = (uint64_t **)malloc(valid_cnt * sizeof(uint64_t *)); for (i = 0; i arr_len; i++) { -if (is_valir_arr[i]) +if (is_valid_arr[i]) res_arr[j++] = arr[i]; } +free(is_valid_arr); free(arr); *p_res_arr = res_arr; diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index c84fb8b..51dd7b9 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -101,12 +101,6 @@ static void __free_single_element(void *p_element, void *context) free(p_element); } -static void __free_port_map_element(cl_map_item_t *p_element, void *context) -{ - if (p_element) - free(p_element); -} - /*** ***/ @@ -145,6 +139,9 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create() void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) { + osm_qos_port_t * p_port; + osm_qos_port_t * p_old_port; + if (!p)
[ofa-general] [PATCH 3/3] osm: QoS - parsing port names
Added CA-by-name hash to the QoS policy object and as port names are parsed they use this hash to locate that actual port that the name refers to. For now I prefer to keep this hash local, so it's part of QoS policy object. When the same parser will be used for partitions too, this hash will be moved to be part of the subnet object. Signed-off-by: Yevgeny Kliteynik [EMAIL PROTECTED] --- opensm/include/opensm/osm_qos_policy.h |3 +- opensm/opensm/osm_qos_parser.y | 73 +++- opensm/opensm/osm_qos_policy.c | 36 +--- 3 files changed, 94 insertions(+), 18 deletions(-) diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h index 30c2e6d..5c32896 100644 --- a/opensm/include/opensm/osm_qos_policy.h +++ b/opensm/include/opensm/osm_qos_policy.h @@ -49,6 +49,7 @@ #include iba/ib_types.h #include complib/cl_list.h +#include opensm/st.h #include opensm/osm_port.h #include opensm/osm_partition.h #include opensm/osm_sa_path_record.h @@ -72,7 +73,6 @@ typedef struct _osm_qos_port_t { typedef struct _osm_qos_port_group_t { char *name; /* single string (this port group name) */ char *use; /* single string (description) */ - cl_list_t port_name_list; /* list of port names (.../.../...) */ uint8_t node_types; /* node types bitmask */ cl_qmap_t port_map; } osm_qos_port_group_t; @@ -148,6 +148,7 @@ typedef struct _osm_qos_policy_t { cl_list_t qos_match_rules; /* list of osm_qos_match_rule_t */ osm_qos_level_t *p_default_qos_level; /* default QoS level */ osm_subn_t *p_subn; /* osm subnet object */ + st_table * p_ca_hash; /* hash of CAs by node description */ } osm_qos_policy_t; /***/ diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y index 2405519..cf342d3 100644 --- a/opensm/opensm/osm_qos_parser.y +++ b/opensm/opensm/osm_qos_parser.y @@ -603,23 +603,74 @@ port_group_use_start: TK_USE { port_group_port_name: port_group_port_name_start string_list { /* 'port-name' in 'port-group' - any num of instances */ -cl_list_iterator_tlist_iterator; -char* tmp_str; - -list_iterator = cl_list_head(tmp_parser_struct.str_list); -while( list_iterator != cl_list_end(tmp_parser_struct.str_list) ) +cl_list_iterator_t list_iterator; +osm_node_t * p_node; +osm_physp_t * p_physp; +unsigned port_num; +char * name_str; +char * tmp_str; +char * host_str; +char * ca_str; +char * port_str; +char * node_desc = (char*)malloc(IB_NODE_DESCRIPTION_SIZE + 1); + +/* parsing port name strings */ +for (list_iterator = cl_list_head(tmp_parser_struct.str_list); + list_iterator != cl_list_end(tmp_parser_struct.str_list); + list_iterator = cl_list_next(list_iterator)) { tmp_str = (char*)cl_list_obj(list_iterator); +if (tmp_str *tmp_str) +{ +name_str = tmp_str; +host_str = strtok (name_str,/); +ca_str = strtok (NULL, /); +port_str = strtok (NULL, /); + +if (!host_str || !(*host_str) || +!ca_str || !(*ca_str) || +!port_str || !(*port_str) || +(port_str[0] != 'p' port_str[0] != 'P')) { +yyerror(illegal port name); +free(tmp_str); +free(node_desc); + cl_list_remove_all(tmp_parser_struct.str_list); +return 1; +} -/* - * TODO: parse port name strings - */ +if (!(port_num = strtoul(port_str[1],NULL,0))) { +yyerror(illegal port number in port name); +free(tmp_str); +
[ofa-general] ofa_1_3_kernel 20071009-0200 daily build status
This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.22 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Failed: Build failed on x86_64 with linux-2.6.9-22.ELsmp Log: Applying patch libiscsi_no_flush_to_2_6_9.patch patching file drivers/scsi/libiscsi.c Hunk #1 FAILED at 1225. Hunk #2 succeeded at 1640 (offset 32 lines). Hunk #3 FAILED at 1784. 2 out of 3 hunks FAILED -- rejects in file drivers/scsi/libiscsi.c Patch libiscsi_no_flush_to_2_6_9.patch does not apply (enforce with -f) Failed executing /usr/bin/quilt -- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '-' /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '-' /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '-' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 -- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '-' /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '-' /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '-' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux
Re: [ofa-general] [PATCH 13 of 17]: add LRO support
Can you clarify if/how this patch is related to the lro: Generic Large Receive Offload for TCP traffic RFC sent on August this year to netdev (eg see http://lwn.net/Articles/244206) ? I referred to mtnic driver when I made this patch which referred to other code examples, possibly from this one too. Assuming LRO is a --pure software-- optimization, what's the rational to put its whole implementation in the ipoib driver and not divide it to general part implemented in the net core and per driver part implemented per device driver that wants to support LRO (if such second part is needed at all)? It is a pure software optimization but it relies on the HW to report whether the checksum of the packet is valid or not in order for it to be liable for aggregation. I think it would be good however if the kernel would support this and take this from the specific drivers. If I am wrong and their is some LRO assistance from the connectX HW, what is it doing? Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching
Hi Peter, Waskiewicz Jr, Peter P [EMAIL PROTECTED] wrote on 10/09/2007 04:03:42 AM: true, that needs some resolution. Heres a hand-waving thought: Assuming all packets of a specific map end up in the same qdiscn queue, it seems feasible to ask the qdisc scheduler to give us enough packages (ive seen people use that terms to refer to packets) for each hardware ring's available space. With the patches i posted, i do that via dev-xmit_win that assumes only one view of the driver; essentially a single ring. If that is doable, then it is up to the driver to say i have space for 5 in ring[0], 10 in ring[1] 0 in ring[2] based on what scheduling scheme the driver implements - the dev-blist can stay the same. Its a handwave, so there may be issues there and there could be better ways to handle this. Note: The other issue that needs resolving that i raised earlier was in regards to multiqueue running on multiple cpus servicing different rings concurently. I can see the qdisc being modified to send batches per queue_mapping. This shouldn't be too difficult, and if we had the xmit_win per queue (in the subqueue struct like Dave pointed out). I hope my understanding of multiqueue is correct for this mail to make sense :-) Isn't it enough that the multiqueue+batching drivers handle skbs belonging to different queue's themselves, instead of qdisc having to figure that out? This will reduce costs for most skbs that are neither batched nor sent to multiqueue devices. Eg, driver can keep processing skbs and put to the correct tx_queue as long as mapping remains the same. If the mapping changes, it posts earlier skbs (with the correct lock) and then iterates for the other skbs that have the next different mapping, and so on. (This is required only if driver is supposed to transmit 1 skb in one call, otherwise it is not an issue) Alternatively, supporting drivers could return a different code on mapping change, like: NETDEV_TX_MAPPING_CHANGED (for batching only) so that qdisc_run() could retry. Would that work? Secondly having xmit_win per queue: would it help in multiple skb case? Currently there is no way to tell qdisc to dequeue skbs from a particular band - it returns skb from highest priority band. thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
From: Krishna Kumar2 [EMAIL PROTECTED] Date: Tue, 9 Oct 2007 16:28:27 +0530 Isn't it enough that the multiqueue+batching drivers handle skbs belonging to different queue's themselves, instead of qdisc having to figure that out? This will reduce costs for most skbs that are neither batched nor sent to multiqueue devices. Eg, driver can keep processing skbs and put to the correct tx_queue as long as mapping remains the same. If the mapping changes, it posts earlier skbs (with the correct lock) and then iterates for the other skbs that have the next different mapping, and so on. The complexity in most of these suggestions is beginning to drive me a bit crazy :-) This should be the simplest thing in the world, when TX queue has space, give it packets. Period. When I hear suggestions like have the driver pick the queue in -hard_start_xmit() and return some special status if the queue becomes different. you know, I really begin to wonder :-) If we have to go back, get into the queueing layer locks, have these special cases, and whatnot, what's the point? This code should eventually be able to run lockless all the way to the TX queue handling code of the driver. The queueing code should know what TX queue the packet will be bound for, and always precisely invoke the driver in a state where the driver can accept the packet. Ignore LLTX, it sucks, it was a big mistake, and we will get rid of it. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
Hi Dave, David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM: Isn't it enough that the multiqueue+batching drivers handle skbs belonging to different queue's themselves, instead of qdisc having to figure that out? This will reduce costs for most skbs that are neither batched nor sent to multiqueue devices. Eg, driver can keep processing skbs and put to the correct tx_queue as long as mapping remains the same. If the mapping changes, it posts earlier skbs (with the correct lock) and then iterates for the other skbs that have the next different mapping, and so on. The complexity in most of these suggestions is beginning to drive me a bit crazy :-) This should be the simplest thing in the world, when TX queue has space, give it packets. Period. When I hear suggestions like have the driver pick the queue in -hard_start_xmit() and return some special status if the queue becomes different. you know, I really begin to wonder :-) If we have to go back, get into the queueing layer locks, have these special cases, and whatnot, what's the point? I understand your point, but the qdisc code itself needs almost no change, as small as: qdisc_restart() { ... case NETDEV_TX_MAPPING_CHANGED: /* * Driver sent some skbs from one mapping, and found others * are for different queue_mapping. Try again. */ ret = 1; /* guaranteed to have atleast 1 skb in batch list */ break; ... } Alternatively if the driver does all the dirty work, qdisc needs no change at all. However, I am not sure if this addresses all the concerns raised by you, Peter, Jamal, others. This code should eventually be able to run lockless all the way to the TX queue handling code of the driver. The queueing code should know what TX queue the packet will be bound for, and always precisely invoke the driver in a state where the driver can accept the packet. This sounds like a good idea :) I need to think more on this, esp as my batching sends multiple skbs of possibly different mappings to device, and those skbs stay in batch list if driver couldn't send them out. thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM: Ignore LLTX, it sucks, it was a big mistake, and we will get rid of it. Great, this will make life easy. Any idea how long that would take? It seems simple enough to do. thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
From: Krishna Kumar2 [EMAIL PROTECTED] Date: Tue, 9 Oct 2007 16:51:14 +0530 David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM: Ignore LLTX, it sucks, it was a big mistake, and we will get rid of it. Great, this will make life easy. Any idea how long that would take? It seems simple enough to do. I'd say we can probably try to get rid of it in 2.6.25, this is assuming we get driver authors to cooperate and do the conversions or alternatively some other motivated person. I can just threaten to do them all and that should get the driver maintainers going :-) ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] question regarding umad_recv
On Tue, 2007-10-09 at 13:01 +0530, Sumit Gaur - Sun Microsystem wrote: Hi, It is regarding *umad_recv* function of libibumad/src/umad.c file. Is it not possible to recv MAD specific to GSI or SMI type. As per my impression if I have two separate threads to send and receive then I could send MADs to different qp 0 or 1 depend on GSI and SMI MAD. But receiving has no control over it. Please suggest if there is any workaround for it. See umad_register(). -- Hal Thanks and Regards sumit ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
David Miller wrote: From: Krishna Kumar2 [EMAIL PROTECTED] Date: Tue, 9 Oct 2007 16:51:14 +0530 David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM: Ignore LLTX, it sucks, it was a big mistake, and we will get rid of it. Great, this will make life easy. Any idea how long that would take? It seems simple enough to do. I'd say we can probably try to get rid of it in 2.6.25, this is assuming we get driver authors to cooperate and do the conversions or alternatively some other motivated person. I can just threaten to do them all and that should get the driver maintainers going :-) What, like this? :) Jeff drivers/net/atl1/atl1_main.c | 16 +--- drivers/net/chelsio/cxgb2.c|1 - drivers/net/chelsio/sge.c | 20 +--- drivers/net/e1000/e1000_main.c |6 +- drivers/net/ixgb/ixgb_main.c | 24 drivers/net/pasemi_mac.c |2 +- drivers/net/rionet.c | 19 +++ drivers/net/spider_net.c |2 +- drivers/net/sungem.c | 17 ++--- drivers/net/tehuti.c | 12 +--- drivers/net/tehuti.h |3 +-- 11 files changed, 32 insertions(+), 90 deletions(-) diff --git a/drivers/net/atl1/atl1_main.c b/drivers/net/atl1/atl1_main.c index 4c728f1..03e94fe 100644 --- a/drivers/net/atl1/atl1_main.c +++ b/drivers/net/atl1/atl1_main.c @@ -1665,10 +1665,7 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct net_device *netdev) len -= skb-data_len; - if (unlikely(skb-len == 0)) { - dev_kfree_skb_any(skb); - return NETDEV_TX_OK; - } + WARN_ON(skb-len == 0); param.data = 0; param.tso.tsopu = 0; @@ -1703,11 +1700,7 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct net_device *netdev) } } - if (!spin_trylock_irqsave(adapter-lock, flags)) { - /* Can't get lock - tell upper layer to requeue */ - dev_printk(KERN_DEBUG, adapter-pdev-dev, tx locked\n); - return NETDEV_TX_LOCKED; - } + spin_lock_irqsave(adapter-lock, flags); if (atl1_tpd_avail(adapter-tpd_ring) count) { /* not enough descriptors */ @@ -1749,8 +1742,11 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct net_device *netdev) atl1_tx_map(adapter, skb, 1 == val); atl1_tx_queue(adapter, count, param); netdev-trans_start = jiffies; + spin_unlock_irqrestore(adapter-lock, flags); + atl1_update_mailbox(adapter); + return NETDEV_TX_OK; } @@ -2301,8 +2297,6 @@ static int __devinit atl1_probe(struct pci_dev *pdev, */ /* netdev-features |= NETIF_F_TSO; */ - netdev-features |= NETIF_F_LLTX; - /* * patch for some L1 of old version, * the final version of L1 may not need these diff --git a/drivers/net/chelsio/cxgb2.c b/drivers/net/chelsio/cxgb2.c index 2dbf8dc..0aba7e7 100644 --- a/drivers/net/chelsio/cxgb2.c +++ b/drivers/net/chelsio/cxgb2.c @@ -1084,7 +1084,6 @@ static int __devinit init_one(struct pci_dev *pdev, netdev-mem_end = mmio_start + mmio_len - 1; netdev-priv = adapter; netdev-features |= NETIF_F_SG | NETIF_F_IP_CSUM; - netdev-features |= NETIF_F_LLTX; adapter-flags |= RX_CSUM_ENABLED | TCP_CSUM_CAPABLE; if (pci_using_dac) diff --git a/drivers/net/chelsio/sge.c b/drivers/net/chelsio/sge.c index ffa7e64..84f5869 100644 --- a/drivers/net/chelsio/sge.c +++ b/drivers/net/chelsio/sge.c @@ -1739,8 +1739,7 @@ static int t1_sge_tx(struct sk_buff *skb, struct adapter *adapter, struct cmdQ *q = sge-cmdQ[qid]; unsigned int credits, pidx, genbit, count, use_sched_skb = 0; - if (!spin_trylock(q-lock)) - return NETDEV_TX_LOCKED; + spin_lock(q-lock); reclaim_completed_tx(sge, q); @@ -1817,12 +1816,12 @@ use_sched: } if (use_sched_skb) { - if (spin_trylock(q-lock)) { - credits = q-size - q-in_use; - skb = NULL; - goto use_sched; - } + spin_lock(q-lock); + credits = q-size - q-in_use; + skb = NULL; + goto use_sched; } + return NETDEV_TX_OK; } @@ -1977,13 +1976,12 @@ static void sge_tx_reclaim_cb(unsigned long data) for (i = 0; i SGE_CMDQ_N; ++i) { struct cmdQ *q = sge-cmdQ[i]; - if (!spin_trylock(q-lock)) - continue; + spin_lock(q-lock); reclaim_completed_tx(sge, q); - if (i == 0 q-in_use) {/* flush pending credits */ + if (i == 0 q-in_use)/* flush pending credits */ writel(F_CMDQ0_ENABLE,
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On Tue, Oct 09, 2007 at 08:44:25AM -0400, Jeff Garzik wrote: David Miller wrote: I can just threaten to do them all and that should get the driver maintainers going :-) What, like this? :) Awsome :) -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
Herbert Xu wrote: On Tue, Oct 09, 2007 at 08:44:25AM -0400, Jeff Garzik wrote: David Miller wrote: I can just threaten to do them all and that should get the driver maintainers going :-) What, like this? :) Awsome :) Note my patch is just to get the maintainers going. :) I'm not going to commit that, since I don't have any way to test any of the drivers I touched (but I wouldn't scream if it appeared in net-2.6.24 either) Jeff ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On Tue, 2007-09-10 at 08:39 +0530, Krishna Kumar2 wrote: Driver might ask for 10 and we send 10, but LLTX driver might fail to get lock and return TX_LOCKED. I haven't seen your code in greater detail, but don't you requeue in that case too? For others drivers that are non-batching and LLTX, it is possible - at the moment in my patch i whine that the driver is buggy. I will fix this up so it checks for NETIF_F_BTX. Thanks for pointing the above use case. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCHES] TX batching
On Tue, 2007-09-10 at 13:44 +0530, Krishna Kumar2 wrote: My feeling is that since the approaches are very different, My concern is the approaches are different only for short periods of time. For example, I do requeueing, have xmit_win, have -end_xmit, do batching from core etc; if you see value in any of these concepts, they will appear in your patches and this goes on a loop. Perhaps what we need is a referee and use our energies in something more positive. it would be a good idea to test the two for performance. Which i dont mind as long as it has an analysis that goes with it. If all you post is heres what netperf showed, it is not useful at all. There are also a lot of affecting variables. For example, is the receiver a bottleneck? To make it worse, I could demonstrate to you that if i slowed down the driver and allowed more packets to queue up on the qdisc, batching will do well. In the past my feeling is you glossed over such details and i am sucker for things like that - hence the conflict. Do you mind me doing that? Ofcourse others and/or you are more than welcome to do the same. I had sent a note to you yesterday about this, please let me know either way. I responded to you - but it may have been lost in the noise; heres a copy: http://marc.info/?l=linux-netdevm=119185137124008w=2 cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
On Mon, 8 Oct 2007, Steve Wise wrote: The correct solution, IMO, is to enhance the core low level 4-tuple allocation services to be more generic (eg: not be tied to a struct sock). Then the host tcp stack and the host rdma stack can allocate TCP/iWARP ports/4tuples from this common exported service and share the port space. This allocation service could also be used by other deep adapters like iscsi adapters if needed. As a developer of an RDMA ULP, NFS-RDMA, I like this approach because it will simplify the configuration of an RDMA device and the services that use it. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24
Roland Dreier wrote: No mention about the iwarp port space issue? I don't think we're at a stage where I'm prepared to merge something-- we all agree the latest patch has serious drawbacks, and it commits us to a suboptimal interface that is userspace-visible. Fair enough. I'm at a loss as to how to proceed. Could we try to do some cleanups to the net core to make the alias stuff less painful? eg is there any sane way to make it possible for a device that creates 'eth0' to also create an 'iw0' alias without an assigning an address? Well, alias interfaces really don't exist. ethX:iw is really just adding a address record (struct in_ifaddr) to ethX. So in the current core design, adding an alias without an address is really adding the alias with address 0.0.0.0. And I think the core net code assumes if an in_ifaddr struct exists for a device, then its IP address is indeed valid. So I think the changes wouldn't be small to enhance the design to add a concept of an alias interface. I'll look into this more though. Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: parallel networking
At 06:53 PM 10/8/2007, Jeff Garzik wrote: David Miller wrote: From: Jeff Garzik [EMAIL PROTECTED] Date: Mon, 08 Oct 2007 10:22:28 -0400 In terms of overall parallelization, both for TX as well as RX, my gut feeling is that we want to move towards an MSI-X, multi-core friendly model where packets are LIKELY to be sent and received by the same set of [cpus | cores | packages | nodes] that the [userland] processes dealing with the data. The problem is that the packet schedulers want global guarantees on packet ordering, not flow centric ones. That is the issue Jamal is concerned about. Oh, absolutely. I think, fundamentally, any amount of cross-flow resource management done in software is an obstacle to concurrency. That's not a value judgement, just a statement of fact. Correct. traffic cops are intentional bottlenecks we add to the process, to enable features like priority flows, filtering, or even simple socket fairness guarantees. Each of those bottlenecks serves a valid purpose, but at the end of the day, it's still a bottleneck. So, improving concurrency may require turning off useful features that nonetheless hurt concurrency. Software needs to get out of the main data path - another fact of life. The more I think about it, the more inevitable it seems that we really might need multiple qdiscs, one for each TX queue, to pull this full parallelization off. But the semantics of that don't smell so nice either. If the user attaches a new qdisc to ethN, does it go to all the TX queues, or what? All of the traffic shaping technology deals with the device as a unary object. It doesn't fit to multi-queue at all. Well the easy solutions to networking concurrency are * use virtualization to carve up the machine into chunks * use multiple net devices Since new NIC hardware is actively trying to be friendly to multi-channel/virt scenarios, either of these is reasonably straightforward given the current state of the Linux net stack. Using multiple net devices is especially attractive because it works very well with the existing packet scheduling. Both unfortunately impose a burden on the developer and admin, to force their apps to distribute flows across multiple [VMs | net devs]. Not the most optimal approach. The third alternative is to use a single net device, with SMP-friendly packet scheduling. Here you run into the problems you described device as a unary object etc. with the current infrastructure. With multiple TX rings, consider that we are pushing the packet scheduling from software to hardware... which implies * hardware-specific packet scheduling * some TC/shaping features not available, because hardware doesn't support it For a number of years now, we have designed interconnects to support a reasonable range of arbitration capabilities among hardware resource sets. With reasonable classification by software to identify a hardware resource sets (ideally interpretation of the application's view of its priority combined with policy management software that determines how that should map among competing application views), one can eliminate most of the CPU cycles spent into today's implementations. I and others presented a number of these concepts many years ago during the development which eventually led to IB and iWARP. - Each resource set can be assigned to a unique PCIe function or a function group to enable function / group arbitration to the PCIe link. - Each resource set can be assigned to a unique PCIe TC and with improved ordering hints (coming soon) can be used to eliminate false ordering dependencies. - Each resource set can be assigned to a unique IB TC / SL or iWARP 802.1p to signal priority. These can then be used to program respective link arbitration as well as path selection to enable multi-path load balancing. - Many IHV have picked up on the arbitration capabilities and extended them as shown years ago by a number of us to enable resource set arbitration and a variety of QoS based policies. If software defines a reasonable (i.e. small) number of management and control knobs, then these can be easily mapped to most h/w implementations. Some of us are working on how to do this for virtualized environments and I expect these to be applicable to all environments in the end. One other key item to keep in mind is that unless there is contention in the system, the majority of the QoS mechanisms are meaningless and in a very large percentage of customer environments, they simply don't scale with device and interconnect performance. Many applications in fact remain processor / memory constrained and therefore do not stress the I/O subsystem or the external interconnects making most of the software mechanisms rather moot in real customer environments. Simple truth is it is nearly always cheaper to over-provision the I/O / interconnects than to use the software approach which while quite
[ofa-general] SDP ?
Hi all, I'm working on porting SDP to OpenSolaris and am looking at a compile error that I get. Essentially, I have a conflict of types on the compile: bash-3.00$ /opt/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I. -I.. -g -D_POSIX_PTHREAD_SEMANTICS -DSYSCONFDIR=\/usr/local/etc\ -g -D_POSIX_PTHREAD_SEMANTICS -c port.c -KPIC -DPIC -o .libs/port.o port.c, line 1896: identifier redeclared: getsockname current : function(int, pointer to struct sockaddr {unsigned short sa_family, array[14] of char sa_data}, pointer to unsigned int) returning int previous: function(int, pointer to struct sockaddr {unsigned short sa_family, array[14] of char sa_data}, pointer to void) returning int : /usr/include/sys/socket.h, line 436 Line 436 in /usr/include/sys/socket.h extern int getsockname(int, struct sockaddr *_RESTRICT_KYWD, Psocklen_t); and Psocklen_t #if defined(_XPG4_2) || defined(_BOOT) typedef socklen_t *_RESTRICT_KYWD Psocklen_t; #else typedef void*_RESTRICT_KYWD Psocklen_t; #endif /* defined(_XPG4_2) || defined(_BOOT) */ Do I need to change port.c getsockname to type void * ? Thanks, Jim ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24
Roland, I submitted an updated patch incorporating some of Sean's comments within a day or two. Rest of comments pertained to restructuring the code and adding some additional module parameters. This would require more discussions since some of these had been already discussed previously. We had decided upon this code structure after a lot of discussions and incorporating these would be undoing some of that. Can you give a link to your current final version of the patch? Sean, what's your opinion of where we stand? Since module parameters create a userspace-visible interface that we are stuck with for a long time, we definitely have to get at least that much right before merging. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] librdmacm feature request
It shouldn't be too hard. Assuming you handle the modify channel as a synchronous action, the thread calling modify channel can't also be in rdma_get_cm_event at the same time. So, if you get there and someone is blocking on that channel and just hasn't been scheduled to run yet, then leave the event where it is while you switch the channel and send new events to the new channel. If they aren't then move any pending events to the new channel as you do the change. Hmm, how do you move events? Keep in mind that there may be an arbitrary number of pending events that belong to other cm_ids that are queued before the events you want to move. And you can't really do anything too funky with the event channel fd, because you don't want to mess up some other thread that might be waiting for events in poll() or whatever. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] core: Check that the function reg_phys_mr is not NULL before executing it
Check that the function reg_phys_mr is not NULL before executing it. There are devices (for example: mlx4) that their low level driver doesn't support this verb, so this patch will prevent kernel oops on them. Signed-off-by: Dotan Barak [EMAIL PROTECTED] --- diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 86ed8af..e2d54cb 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -672,6 +672,9 @@ struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, { struct ib_mr *mr; + if (!pd-device-reg_phys_mr) + return -ENOSYS; + mr = pd-device-reg_phys_mr(pd, phys_buf_array, num_phys_buf, mr_access_flags, iova_start); ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RLIMIT_MEMLOCK
We have run into this problem with using mpiexec. SLES 10 is on the cluster and we have set the limits under /etc/security/limits.conf and they work there, even when we run mpirun commands work fine but when tying them all in using mpiexec it still comes back with the 32K limit in memory. Any and all users can log in and in bash type ulimit -a and tcsh type limit and both state the correct full memory limits, but when using mpiexec under both shells they get the 32k limit. Any suggestions? thanks -- Adam Miller The College of William and Mary Virginia Institute of Marine Science -Infrastructure Services Architect- -Information Technology and Networking Services- Watermens Hall Mail: P.O. Box 1346 Deliveries: Route 1208, Greate Road Gloucester Point, VA 23062-1346, USA p(804)684-7077 f(804)684-7097 email: [EMAIL PROTECTED] email cell: [EMAIL PROTECTED] ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [ewg] libcxgb3-1.0.3 available for ofed-1.2.5 and ofed-1.3
Steve Wise wrote: Vlad/Tziporet, Can you please pull version 1.0.3 of libcxgb3 for inclusion in ofed-1.2.5 and ofed-1.3? It contains a bug fix for olders kernels like RHEL4U4. You can use the master branch for both releases: git://git.openfabrics.org/~swise/libcxgb3.git master Also, please update the spec file you're using to reflect the release (1.0.3). The spec file in the libcxgb3 git tree should be correct. Thanks, Steve. Done, Regards, Vladimir ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [ewg] libcxgb3-1.0.3 available for ofed-1.2.5 and ofed-1.3
Thanks Vlad, Can you crank a ofed-1.2.5 development build too? Thanks, Steve. Vladimir Sokolovsky wrote: Steve Wise wrote: Vlad/Tziporet, Can you please pull version 1.0.3 of libcxgb3 for inclusion in ofed-1.2.5 and ofed-1.3? It contains a bug fix for olders kernels like RHEL4U4. You can use the master branch for both releases: git://git.openfabrics.org/~swise/libcxgb3.git master Also, please update the spec file you're using to reflect the release (1.0.3). The spec file in the libcxgb3 git tree should be correct. Thanks, Steve. Done, Regards, Vladimir ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
David Miller [EMAIL PROTECTED] writes: 2) Switch the default qdisc away from pfifo_fast to a new DRR fifo with load balancing using the code in #1. I think this is kind of in the territory of what Peter said he is working on. Hopefully that new qdisc will just use the TX rings of the hardware directly. They are typically large enough these days. That might avoid some locking in this critical path. I know this is controversial, but realistically I doubt users benefit at all from the prioritization that pfifo provides. I agree. For most interfaces the priority is probably dubious. Even for DSL the prioritization will be likely usually done in a router these days. Also for the fast interfaces where we do TSO priority doesn't work very well anyways -- with large packets there is not too much to prioritize. 3) Work on discovering a way to make the locking on transmit as localized to the current thread of execution as possible. Things like RCU and statistic replication, techniques we use widely elsewhere in the stack, begin to come to mind. If the data is just passed on to the hardware queue, why is any locking needed at all? (except for the driver locking of course) -Andi ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24
Can you give a link to your current final version of the patch? Sean, what's your opinion of where we stand? Let me look back over the last version that was sent and reply back later today or tomorrow. Several of my initial comments were on code structure. Since module parameters create a userspace-visible interface that we are stuck with for a long time, we definitely have to get at least that much right before merging. I was taking a slightly different view of the design. It would be nice to agree on whether SRQ should be separated from the QP type before merging upstream, even if the implementation doesn't immediately support all available options. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [ewg] libcxgb3-1.0.3 available for ofed-1.2.5 and ofed-1.3
Steve Wise wrote: Thanks Vlad, Can you crank a ofed-1.2.5 development build too? Thanks, Steve. Done: http://www.openfabrics.org/builds/connectx/OFED-1.2.5-20071009-0955.tgz Regards, Vladimir ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 2.6.24] rdma/cm: fix deadlock destroying listen requests
Deadlock condition reported by Kanoj Sarcar [EMAIL PROTECTED] The deadlock occurs when a connection request arrives at the same time that a wildcard listen is being destroyed. A wildcard listen maintains per device listen requests for each RDMA device in the system. The per device listens are automatically added and removed when RDMA devices are inserted or removed from the system. When a wildcard listen is destroyed, rdma_destroy_id() acquires the rdma_cm's device mutex ('lock') to protect against hot-plug events adding or removing per device listens. It then tries to destroy the per device listens by calling ib_destroy_cm_id() or iw_destroy_cm_id(). It does this while holding the device mutex. However, if the underlying iw/ib CM reports a connection request while this is occurring, the rdma_cm callback function will try to acquire the same device mutex. Since we're in a callback, the ib_destroy_cm_id() or iw_destroy_cm_id() calls will block until their callback thread returns, but the callback is blocked waiting for the device mutex. Fix this by re-working how per device listens are destroyed. Use rdma_destroy_id(), which avoids the deadlock, in place of cma_destroy_listen(). Additional synchronization is added to handle device hot-plug events and ensure that the id is not destroyed twice. Signed-off-by: Sean Hefty [EMAIL PROTECTED] --- Fix from discussion started at: http://lists.openfabrics.org/pipermail/general/2007-October/041456.html Kanoj, please verify that this fix looks correct and works for you, and I will queue for 2.6.24. drivers/infiniband/core/cma.c | 70 + 1 files changed, 23 insertions(+), 47 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 9ffb998..21ea92c 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -113,11 +113,12 @@ struct rdma_id_private { struct rdma_bind_list *bind_list; struct hlist_node node; - struct list_headlist; - struct list_headlisten_list; + struct list_headlist; /* listen_any_list or cma_device.list */ + struct list_headlisten_list; /* per device listens */ struct cma_device *cma_dev; struct list_headmc_list; + int internal_id; enum cma_state state; spinlock_t lock; struct completion comp; @@ -715,50 +716,27 @@ static void cma_cancel_route(struct rdma_id_private *id_priv) } } -static inline int cma_internal_listen(struct rdma_id_private *id_priv) -{ - return (id_priv-state == CMA_LISTEN) id_priv-cma_dev - cma_any_addr(id_priv-id.route.addr.src_addr); -} - -static void cma_destroy_listen(struct rdma_id_private *id_priv) -{ - cma_exch(id_priv, CMA_DESTROYING); - - if (id_priv-cma_dev) { - switch (rdma_node_get_transport(id_priv-id.device-node_type)) { - case RDMA_TRANSPORT_IB: - if (id_priv-cm_id.ib !IS_ERR(id_priv-cm_id.ib)) - ib_destroy_cm_id(id_priv-cm_id.ib); - break; - case RDMA_TRANSPORT_IWARP: - if (id_priv-cm_id.iw !IS_ERR(id_priv-cm_id.iw)) - iw_destroy_cm_id(id_priv-cm_id.iw); - break; - default: - break; - } - cma_detach_from_dev(id_priv); - } - list_del(id_priv-listen_list); - - cma_deref_id(id_priv); - wait_for_completion(id_priv-comp); - - kfree(id_priv); -} - static void cma_cancel_listens(struct rdma_id_private *id_priv) { struct rdma_id_private *dev_id_priv; + /* +* Remove from listen_any_list to prevent added devices from spawning +* additional listen requests. +*/ mutex_lock(lock); list_del(id_priv-list); while (!list_empty(id_priv-listen_list)) { dev_id_priv = list_entry(id_priv-listen_list.next, struct rdma_id_private, listen_list); - cma_destroy_listen(dev_id_priv); + /* sync with device removal to avoid duplicate destruction */ + list_del_init(dev_id_priv-list); + list_del(dev_id_priv-listen_list); + mutex_unlock(lock); + + rdma_destroy_id(dev_id_priv-id); + mutex_lock(lock); } mutex_unlock(lock); } @@ -846,6 +824,9 @@ void rdma_destroy_id(struct rdma_cm_id *id) cma_deref_id(id_priv); wait_for_completion(id_priv-comp); + if (id_priv-internal_id) + cma_deref_id(id_priv-id.context); + kfree(id_priv-id.route.path_rec); kfree(id_priv); } @@ -1401,14 +1382,13 @@ static void cma_listen_on_dev(struct rdma_id_private
Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On 09 Oct 2007 18:51:51 +0200 Andi Kleen [EMAIL PROTECTED] wrote: David Miller [EMAIL PROTECTED] writes: 2) Switch the default qdisc away from pfifo_fast to a new DRR fifo with load balancing using the code in #1. I think this is kind of in the territory of what Peter said he is working on. Hopefully that new qdisc will just use the TX rings of the hardware directly. They are typically large enough these days. That might avoid some locking in this critical path. I know this is controversial, but realistically I doubt users benefit at all from the prioritization that pfifo provides. I agree. For most interfaces the priority is probably dubious. Even for DSL the prioritization will be likely usually done in a router these days. Also for the fast interfaces where we do TSO priority doesn't work very well anyways -- with large packets there is not too much to prioritize. 3) Work on discovering a way to make the locking on transmit as localized to the current thread of execution as possible. Things like RCU and statistic replication, techniques we use widely elsewhere in the stack, begin to come to mind. If the data is just passed on to the hardware queue, why is any locking needed at all? (except for the driver locking of course) -Andi I wonder about the whole idea of queueing in general at such high speeds. Given the normal bi-modal distribution of packets, and the predominance of 1500 byte MTU; does it make sense to even have any queueing in software at all? -- Stephen Hemminger [EMAIL PROTECTED] ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
I wonder about the whole idea of queueing in general at such high speeds. Given the normal bi-modal distribution of packets, and the predominance of 1500 byte MTU; does it make sense to even have any queueing in software at all? Yes that is my point -- it should just pass it through directly and the driver can then put it into the different per CPU (or per whatever) queues managed by the hardware. The only thing the qdisc needs to do is to set some bit that says it is ok to put this into difference queues; don't need strict ordering Otherwise if the drivers did that unconditionally they might cause problems with other qdiscs. This would also require that the driver exports some hint to the upper layer on how large its internal queues are. A device with a short queue would still require pfifo_fast. Long queue devices could just pass through. That again could be a single flag. -Andi ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroying listen requests
Sean, I will take a look at your code changes and comment, and hopefully be able to run a quick test on your patch within this week. Just so I understand, did you discover problems (maybe preexisting race conditions) with my previously posted patch? If yes, please point it out, so its easier to review yours; if not, I will assume your patch implements a better locking scheme and review it as such. Thanks. Kanoj Sean Hefty wrote: Deadlock condition reported by Kanoj Sarcar [EMAIL PROTECTED] The deadlock occurs when a connection request arrives at the same time that a wildcard listen is being destroyed. A wildcard listen maintains per device listen requests for each RDMA device in the system. The per device listens are automatically added and removed when RDMA devices are inserted or removed from the system. When a wildcard listen is destroyed, rdma_destroy_id() acquires the rdma_cm's device mutex ('lock') to protect against hot-plug events adding or removing per device listens. It then tries to destroy the per device listens by calling ib_destroy_cm_id() or iw_destroy_cm_id(). It does this while holding the device mutex. However, if the underlying iw/ib CM reports a connection request while this is occurring, the rdma_cm callback function will try to acquire the same device mutex. Since we're in a callback, the ib_destroy_cm_id() or iw_destroy_cm_id() calls will block until their callback thread returns, but the callback is blocked waiting for the device mutex. Fix this by re-working how per device listens are destroyed. Use rdma_destroy_id(), which avoids the deadlock, in place of cma_destroy_listen(). Additional synchronization is added to handle device hot-plug events and ensure that the id is not destroyed twice. Signed-off-by: Sean Hefty [EMAIL PROTECTED] --- Fix from discussion started at: http://lists.openfabrics.org/pipermail/general/2007-October/041456.html Kanoj, please verify that this fix looks correct and works for you, and I will queue for 2.6.24. drivers/infiniband/core/cma.c | 70 + 1 files changed, 23 insertions(+), 47 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 9ffb998..21ea92c 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -113,11 +113,12 @@ struct rdma_id_private { struct rdma_bind_list *bind_list; struct hlist_node node; - struct list_headlist; - struct list_headlisten_list; + struct list_headlist; /* listen_any_list or cma_device.list */ + struct list_headlisten_list; /* per device listens */ struct cma_device *cma_dev; struct list_headmc_list; + int internal_id; enum cma_state state; spinlock_t lock; struct completion comp; @@ -715,50 +716,27 @@ static void cma_cancel_route(struct rdma_id_private *id_priv) } } -static inline int cma_internal_listen(struct rdma_id_private *id_priv) -{ - return (id_priv-state == CMA_LISTEN) id_priv-cma_dev - cma_any_addr(id_priv-id.route.addr.src_addr); -} - -static void cma_destroy_listen(struct rdma_id_private *id_priv) -{ - cma_exch(id_priv, CMA_DESTROYING); - - if (id_priv-cma_dev) { - switch (rdma_node_get_transport(id_priv-id.device-node_type)) { - case RDMA_TRANSPORT_IB: - if (id_priv-cm_id.ib !IS_ERR(id_priv-cm_id.ib)) - ib_destroy_cm_id(id_priv-cm_id.ib); - break; - case RDMA_TRANSPORT_IWARP: - if (id_priv-cm_id.iw !IS_ERR(id_priv-cm_id.iw)) - iw_destroy_cm_id(id_priv-cm_id.iw); - break; - default: - break; - } - cma_detach_from_dev(id_priv); - } - list_del(id_priv-listen_list); - - cma_deref_id(id_priv); - wait_for_completion(id_priv-comp); - - kfree(id_priv); -} - static void cma_cancel_listens(struct rdma_id_private *id_priv) { struct rdma_id_private *dev_id_priv; + /* +* Remove from listen_any_list to prevent added devices from spawning +* additional listen requests. +*/ mutex_lock(lock); list_del(id_priv-list); while (!list_empty(id_priv-listen_list)) { dev_id_priv = list_entry(id_priv-listen_list.next, struct rdma_id_private, listen_list); - cma_destroy_listen(dev_id_priv); + /* sync with device removal to avoid duplicate destruction */ + list_del_init(dev_id_priv-list); + list_del(dev_id_priv-listen_list); + mutex_unlock(lock); + + rdma_destroy_id(dev_id_priv-id); +
Re: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24
Roland Dreier wrote: Roland, I submitted an updated patch incorporating some of Sean's comments within a day or two. Rest of comments pertained to restructuring the code and adding some additional module parameters. This would require more discussions since some of these had been already discussed previously. We had decided upon this code structure after a lot of discussions and incorporating these would be undoing some of that. Can you give a link to your current final version of the patch? Roland, This is the link to the last one that I submitted on 09/18. http://lists.openfabrics.org/pipermail/general/2007-September/040917.html Pradeep ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching
IMO the net driver really should provide a hint as to what it wants. 8139cp and tg3 would probably prefer multiple TX queue behavior to match silicon behavior -- strict prio. If I understand what you just said, I disagree. If your hardware is running strict prio, you don't want to enforce strict prio in the qdisc layer; performing two layers of QoS is excessive, and may lead to results you don't want. The reason I added the DRR qdisc is for the Si that has its own queueing strategy that is not RR. For Si that implements RR (like e1000), you can either use the DRR qdisc, or if you want to prioritize your flows, use PRIO. -PJ Waskiewicz [EMAIL PROTECTED] ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Has libmlx4 been released?
looking at git://git.kernel.org/pub/scm/libs/infiniband/libmlx4.git I don't see any tags or branches. That's right, I haven't made any real release yet. If not, when is the initial release planned? Soon I guess. I don't know of any outstanding issues so it's just a matter of doing a release. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
Waskiewicz Jr, Peter P wrote: IMO the net driver really should provide a hint as to what it wants. 8139cp and tg3 would probably prefer multiple TX queue behavior to match silicon behavior -- strict prio. If I understand what you just said, I disagree. If your hardware is running strict prio, you don't want to enforce strict prio in the qdisc layer; performing two layers of QoS is excessive, and may lead to results you don't want. The reason I added the DRR qdisc is for the Si that has its own queueing strategy that is not RR. For Si that implements RR (like e1000), you can either use the DRR qdisc, or if you want to prioritize your flows, use PRIO. A misunderstanding, I think. To my brain, DaveM's item #2 seemed to assume/require the NIC hardware to balance fairly across hw TX rings, which seemed to preclude the 8139cp/tg3 style of strict-prio hardware. That's what I was responding to. As long as there is some modular way to fit 8139cp/tg3 style multi-TX into our universe, I'm happy :) Jeff ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroying listen requests
Just so I understand, did you discover problems (maybe preexisting race conditions) with my previously posted patch? If yes, please point it out, so its easier to review yours; if not, I will assume your patch implements a better locking scheme and review it as such. I tried to explain the issue somewhat in my change commit and code comments. The issue is synchronizing cleanup of the listen_list with device removal. When an RDMA device is added to the system, a new listen request is added for all wildcard listens. Since the original locking held the mutex throughout the cleanup of the listen list, it prevented adding another listen request during that same time. Similar protection was there for handling device removal. When a device is removed from the system, all internal listen requests associated with that device are destroyed. If the associated wildcard listen is also being destroyed, we need to ensure that we don't try to destroy the same listen twice. My patch, like yours, ends up releasing the mutex while cleaning up the listen_list. I choose to eliminate the cma_destroy_listen() call, and use rdma_destroy_id() as a single destruction path instead. This keeps the locking contained to a single function. (I don't like acquiring a lock in one call and releasing it in another. It puts too much assumption on the caller.) What was missing was ensuring that a device removal didn't try to destroy the same listen request. This is handled by the adding the list_del*() calls to cma_cancel_listens(). Whichever thread removes the listening id from the device list is responsible for its destruction. And because that thread could be the device removal thread, I added a reference from the per device listen to the wildcard listen. Hopefully this makes sense. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] IB/ipath -- patches for 2.6.24
hi roland, here is our current batch of patches. i realize that they are a bit later than you would probably like, i'm sorry about that -- i hope they are straightforward enough to make it into your for-2.6.24 branch. these patches can be git pulled from: git://git.qlogic.com/ipath-linux-2.6 for-roland arthur ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 01/23] IB/ipath -- iba6110 rev4 GPIO counters support
On iba6110 rev4, support for three more IB counters were added. The LocalLinkIntegrityError counter, the ExcessiveBufferOverrunErrors counter and support for error counting of flow control packets on an invalid VL. These counters trigger GPIO interrupts and the sw keeps track of the counts. Since we also use GPIO interrupts to signal packet reception, we need to turn off the fast interrupts, or we risk losing a GPIO interrupt. Signed-off-by: Arthur Jones [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_iba6110.c |8 drivers/infiniband/hw/ipath/ipath_intr.c|4 ++-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c b/drivers/infiniband/hw/ipath/ipath_iba6110.c index 650745d..e1c5998 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6110.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c @@ -1559,6 +1559,14 @@ static int ipath_ht_early_init(struct ipath_devdata *dd) ipath_dev_err(dd, Unsupported InfiniPath serial number %.16s!\n, dd-ipath_serial); + if (dd-ipath_minrev = 4) { + /* Rev4+ reports extra errors via internal GPIO pins */ + dd-ipath_flags |= IPATH_GPIO_ERRINTRS; + dd-ipath_gpio_mask |= IPATH_GPIO_ERRINTR_MASK; + ipath_write_kreg(dd, dd-ipath_kregs-kr_gpio_mask, +dd-ipath_gpio_mask); + } + return 0; } diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index b29fe7e..11b3614 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -1085,8 +1085,8 @@ irqreturn_t ipath_intr(int irq, void *data) * GPIO_2 indicates (on some HT4xx boards) that a packet *has arrived for Port 0. Checking for this *is controlled by flag IPATH_GPIO_INTR. -* GPIO_3..5 on IBA6120 Rev2 chips indicate errors -*that we need to count. Checking for this +* GPIO_3..5 on IBA6120 Rev2 and IBA6110 Rev4 chips indicate +*errors that we need to count. Checking for this *is controlled by flag IPATH_GPIO_ERRINTRS. */ u32 gpiostatus; ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 02/23] IB/ipath - performance optimization for CPU differences
From: Ralph Campbell [EMAIL PROTECTED] Different processors have different ordering restrictions for write combining. By taking advantage of this, we can eliminate some write barriers when writing to the send buffers. Signed-off-by: Ralph Campbell [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_diag.c| 22 +- drivers/infiniband/hw/ipath/ipath_iba6120.c |2 + drivers/infiniband/hw/ipath/ipath_kernel.h |2 + drivers/infiniband/hw/ipath/ipath_verbs.c | 62 --- 4 files changed, 53 insertions(+), 35 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_diag.c b/drivers/infiniband/hw/ipath/ipath_diag.c index cf25cda..4137c77 100644 --- a/drivers/infiniband/hw/ipath/ipath_diag.c +++ b/drivers/infiniband/hw/ipath/ipath_diag.c @@ -446,19 +446,21 @@ static ssize_t ipath_diagpkt_write(struct file *fp, dd-ipath_unit, plen - 1, pbufn); if (dp.pbc_wd == 0) - /* Legacy operation, use computed pbc_wd */ dp.pbc_wd = plen; - - /* we have to flush after the PBC for correctness on some cpus -* or WC buffer can be written out of order */ writeq(dp.pbc_wd, piobuf); - ipath_flush_wc(); - /* copy all by the trigger word, then flush, so it's written + /* +* Copy all by the trigger word, then flush, so it's written * to chip before trigger word, then write trigger word, then -* flush again, so packet is sent. */ - __iowrite32_copy(piobuf + 2, tmpbuf, clen - 1); - ipath_flush_wc(); - __raw_writel(tmpbuf[clen - 1], piobuf + clen + 1); +* flush again, so packet is sent. +*/ + if (dd-ipath_flags IPATH_PIO_FLUSH_WC) { + ipath_flush_wc(); + __iowrite32_copy(piobuf + 2, tmpbuf, clen - 1); + ipath_flush_wc(); + __raw_writel(tmpbuf[clen - 1], piobuf + clen + 1); + } else + __iowrite32_copy(piobuf + 2, tmpbuf, clen); + ipath_flush_wc(); ret = sizeof(dp); diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c index 5b6ac9a..a324c6f 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c @@ -1273,6 +1273,8 @@ static void ipath_pe_tidtemplate(struct ipath_devdata *dd) static int ipath_pe_early_init(struct ipath_devdata *dd) { dd-ipath_flags |= IPATH_4BYTE_TID; + if (ipath_unordered_wc()) + dd-ipath_flags |= IPATH_PIO_FLUSH_WC; /* * For openfabrics, we need to be able to handle an IB header of diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index 7a7966f..d983f92 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -724,6 +724,8 @@ int ipath_set_rx_pol_inv(struct ipath_devdata *dd, u8 new_pol_inv); #define IPATH_LINKACTIVE0x200 /* link current state is unknown */ #define IPATH_LINKUNK 0x400 + /* Write combining flush needed for PIO */ +#define IPATH_PIO_FLUSH_WC 0x1000 /* no IB cable, or no device on IB cable */ #define IPATH_NOCABLE 0x4000 /* Supports port zero per packet receive interrupts via diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 16aa61f..559d4a6 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -631,7 +631,7 @@ static inline u32 clear_upper_bytes(u32 data, u32 n, u32 off) #endif static void copy_io(u32 __iomem *piobuf, struct ipath_sge_state *ss, - u32 length) + u32 length, unsigned flush_wc) { u32 extra = 0; u32 data = 0; @@ -757,11 +757,14 @@ static void copy_io(u32 __iomem *piobuf, struct ipath_sge_state *ss, } /* Update address before sending packet. */ update_sge(ss, length); - /* must flush early everything before trigger word */ - ipath_flush_wc(); - __raw_writel(last, piobuf); - /* be sure trigger word is written */ - ipath_flush_wc(); + if (flush_wc) { + /* must flush early everything before trigger word */ + ipath_flush_wc(); + __raw_writel(last, piobuf); + /* be sure trigger word is written */ + ipath_flush_wc(); + } else + __raw_writel(last, piobuf); } /** @@ -776,6 +779,7 @@ int ipath_verbs_send(struct ipath_devdata *dd, u32 hdrwords, u32 *hdr, u32 len, struct ipath_sge_state *ss) { u32 __iomem *piobuf; + unsigned flush_wc; u32 plen; int ret; @@ -799,47 +803,55 @@ int ipath_verbs_send(struct ipath_devdata *dd, u32 hdrwords, * or WC buffer can be written out of order.
[ofa-general] [PATCH 03/23] IB/ipath - change UD to queue work requests like RC UC
From: Ralph Campbell [EMAIL PROTECTED] The code to post UD sends tried to process work requests at the time ib_post_send() is called without using a WQE queue. This was fine as long as HW resources were available for sending a packet. This patch changes UD to be handled more like RC and UC and shares more code. Signed-off-by: Ralph Campbell [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_qp.c| 11 - drivers/infiniband/hw/ipath/ipath_rc.c| 61 +++-- drivers/infiniband/hw/ipath/ipath_ruc.c | 308 drivers/infiniband/hw/ipath/ipath_uc.c| 77 ++ drivers/infiniband/hw/ipath/ipath_ud.c| 372 ++--- drivers/infiniband/hw/ipath/ipath_verbs.c | 241 +-- drivers/infiniband/hw/ipath/ipath_verbs.h | 35 ++- 7 files changed, 494 insertions(+), 611 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c index 1324b35..a8c4a6b 100644 --- a/drivers/infiniband/hw/ipath/ipath_qp.c +++ b/drivers/infiniband/hw/ipath/ipath_qp.c @@ -338,6 +338,7 @@ static void ipath_reset_qp(struct ipath_qp *qp) qp-s_busy = 0; qp-s_flags = IPATH_S_SIGNAL_REQ_WR; qp-s_hdrwords = 0; + qp-s_wqe = NULL; qp-s_psn = 0; qp-r_psn = 0; qp-r_msn = 0; @@ -751,6 +752,9 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, switch (init_attr-qp_type) { case IB_QPT_UC: case IB_QPT_RC: + case IB_QPT_UD: + case IB_QPT_SMI: + case IB_QPT_GSI: sz = sizeof(struct ipath_sge) * init_attr-cap.max_send_sge + sizeof(struct ipath_swqe); @@ -759,10 +763,6 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, ret = ERR_PTR(-ENOMEM); goto bail; } - /* FALLTHROUGH */ - case IB_QPT_UD: - case IB_QPT_SMI: - case IB_QPT_GSI: sz = sizeof(*qp); if (init_attr-srq) { struct ipath_srq *srq = to_isrq(init_attr-srq); @@ -805,8 +805,7 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, spin_lock_init(qp-r_rq.lock); atomic_set(qp-refcount, 0); init_waitqueue_head(qp-wait); - tasklet_init(qp-s_task, ipath_do_ruc_send, -(unsigned long)qp); + tasklet_init(qp-s_task, ipath_do_send, (unsigned long)qp); INIT_LIST_HEAD(qp-piowait); INIT_LIST_HEAD(qp-timerwait); qp-state = IB_QPS_RESET; diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c index 46744ea..53259da 100644 --- a/drivers/infiniband/hw/ipath/ipath_rc.c +++ b/drivers/infiniband/hw/ipath/ipath_rc.c @@ -81,9 +81,8 @@ static void ipath_init_restart(struct ipath_qp *qp, struct ipath_swqe *wqe) * Note that we are in the responder's side of the QP context. * Note the QP s_lock must be held. */ -static int ipath_make_rc_ack(struct ipath_qp *qp, -struct ipath_other_headers *ohdr, -u32 pmtu, u32 *bth0p, u32 *bth2p) +static int ipath_make_rc_ack(struct ipath_ibdev *dev, struct ipath_qp *qp, +struct ipath_other_headers *ohdr, u32 pmtu) { struct ipath_ack_entry *e; u32 hwords; @@ -192,8 +191,7 @@ static int ipath_make_rc_ack(struct ipath_qp *qp, } qp-s_hdrwords = hwords; qp-s_cur_size = len; - *bth0p = bth0 | (1 22); /* Set M bit */ - *bth2p = bth2; + ipath_make_ruc_header(dev, qp, ohdr, bth0, bth2); return 1; bail: @@ -203,32 +201,39 @@ bail: /** * ipath_make_rc_req - construct a request packet (SEND, RDMA r/w, ATOMIC) * @qp: a pointer to the QP - * @ohdr: a pointer to the IB header being constructed - * @pmtu: the path MTU - * @bth0p: pointer to the BTH opcode word - * @bth2p: pointer to the BTH PSN word * * Return 1 if constructed; otherwise, return 0. - * Note the QP s_lock must be held and interrupts disabled. */ -int ipath_make_rc_req(struct ipath_qp *qp, - struct ipath_other_headers *ohdr, - u32 pmtu, u32 *bth0p, u32 *bth2p) +int ipath_make_rc_req(struct ipath_qp *qp) { struct ipath_ibdev *dev = to_idev(qp-ibqp.device); + struct ipath_other_headers *ohdr; struct ipath_sge_state *ss; struct ipath_swqe *wqe; u32 hwords; u32 len; u32 bth0; u32 bth2; + u32 pmtu = ib_mtu_enum_to_int(qp-path_mtu); char newreq; + unsigned long flags; + int ret = 0; + + ohdr = qp-s_hdr.u.oth; + if (qp-remote_ah_attr.ah_flags IB_AH_GRH) + ohdr = qp-s_hdr.u.l.oth; + + /* +* The lock is needed to synchronize between the sending tasklet, +* the receive interrupt handler,
[ofa-general] [PATCH 04/23] IB/ipath - Verify host bus bandwidth to chip will not limit performance
From: Dave Olson [EMAIL PROTECTED] There have been a number of issues where host bandwidth via HyperTransport or PCIe to the InfiniPath chip has been limited in some fashion (BIOS, configuration, etc.), resulting in user confusion. This check gives a clear warning that something is wrong and needs to be resolved. Signed-off-by: Dave Olson [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_driver.c | 85 1 files changed, 85 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 6ccba36..8fa2bb5 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -34,6 +34,7 @@ #include linux/spinlock.h #include linux/idr.h #include linux/pci.h +#include linux/io.h #include linux/delay.h #include linux/netdevice.h #include linux/vmalloc.h @@ -280,6 +281,88 @@ void __attribute__((weak)) ipath_disable_wc(struct ipath_devdata *dd) { } +/* + * Perform a PIO buffer bandwidth write test, to verify proper system + * configuration. Even when all the setup calls work, occasionally + * BIOS or other issues can prevent write combining from working, or + * can cause other bandwidth problems to the chip. + * + * This test simply writes the same buffer over and over again, and + * measures close to the peak bandwidth to the chip (not testing + * data bandwidth to the wire). On chips that use an address-based + * trigger to send packets to the wire, this is easy. On chips that + * use a count to trigger, we want to make sure that the packet doesn't + * go out on the wire, or trigger flow control checks. + */ +static void ipath_verify_pioperf(struct ipath_devdata *dd) +{ + u32 pbnum, cnt, lcnt; + u32 __iomem *piobuf; + u32 *addr; + u64 msecs, emsecs; + + piobuf = ipath_getpiobuf(dd, pbnum); + if (!piobuf) { + dev_info(dd-pcidev-dev, + No PIObufs for checking perf, skipping\n); + goto done; + + } + + /* +* Enough to give us a reasonable test, less than piobuf size, and +* likely multiple of store buffer length. +*/ + cnt = 1024; + + addr = vmalloc(cnt); + if (!addr) { + dev_info(dd-pcidev-dev, + Couldn't get memory for checking PIO perf, +skipping\n); + goto done; + } + + + preempt_disable(); /* we want reasonably accurate elapsed time */ + msecs = 1 + jiffies_to_msecs(jiffies); + for (lcnt = 0; lcnt 1U; lcnt++) { + /* wait until we cross msec boundary */ + if (jiffies_to_msecs(jiffies) = msecs) + break; + udelay(1); + } + + writeq(0, piobuf); /* length 0, no dwords actually sent */ + ipath_flush_wc(); + + /* +* this is only roughly accurate, since even with preempt we +* still take interrupts that could take a while. Running for +* = 5 msec seems to get us close enough to accurate values +*/ + msecs = jiffies_to_msecs(jiffies); + for (emsecs = lcnt = 0; emsecs = 5UL; lcnt++) { + __iowrite32_copy(piobuf + 64, addr, cnt 2); + emsecs = jiffies_to_msecs(jiffies) - msecs; + } + + /* 1 GiB/sec, slightly over IB SDR line rate */ + if (lcnt (emsecs * 1024U)) + ipath_dev_err(dd, + Performance problem: bandwidth to PIO buffers is + only %u MiB/sec\n, + lcnt / (u32) emsecs); + else + ipath_dbg(PIO buffer bandwidth %u MiB/sec is OK\n, + lcnt / (u32) emsecs); + + preempt_enable(); +done: + if (piobuf) /* disarm it, so it's available again */ + ipath_disarm_piobufs(dd, pbnum, 1); +} + static int __devinit ipath_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) { @@ -515,6 +598,8 @@ static int __devinit ipath_init_one(struct pci_dev *pdev, ret = 0; } + ipath_verify_pioperf(dd); + ipath_device_create_group(pdev-dev, dd); ipathfs_add_device(dd); ipath_user_add(dd); ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 05/23] IB/ipath - Remove unneeded code for ipathfs
From: Ralph Campbell [EMAIL PROTECTED] The ipathfs file system is used to export binary data verses ASCII data such as through /sys. This patch removes some unneeded files since the data is available through other /sys files. Signed-off-by: Ralph Campbell [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_fs.c | 187 1 files changed, 0 insertions(+), 187 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_fs.c b/drivers/infiniband/hw/ipath/ipath_fs.c index 2e689b9..262c25d 100644 --- a/drivers/infiniband/hw/ipath/ipath_fs.c +++ b/drivers/infiniband/hw/ipath/ipath_fs.c @@ -130,175 +130,6 @@ static const struct file_operations atomic_counters_ops = { .read = atomic_counters_read, }; -static ssize_t atomic_node_info_read(struct file *file, char __user *buf, -size_t count, loff_t *ppos) -{ - u32 nodeinfo[10]; - struct ipath_devdata *dd; - u64 guid; - - dd = file-f_path.dentry-d_inode-i_private; - - guid = be64_to_cpu(dd-ipath_guid); - - nodeinfo[0] = /* BaseVersion is SMA */ - /* ClassVersion is SMA */ - (1 8)/* NodeType */ - | (1 0); /* NumPorts */ - nodeinfo[1] = (u32) (guid 32); - nodeinfo[2] = (u32) (guid 0x); - /* PortGUID == SystemImageGUID for us */ - nodeinfo[3] = nodeinfo[1]; - /* PortGUID == SystemImageGUID for us */ - nodeinfo[4] = nodeinfo[2]; - /* PortGUID == NodeGUID for us */ - nodeinfo[5] = nodeinfo[3]; - /* PortGUID == NodeGUID for us */ - nodeinfo[6] = nodeinfo[4]; - nodeinfo[7] = (4 16) /* we support 4 pkeys */ - | (dd-ipath_deviceid 0); - /* our chip version as 16 bits major, 16 bits minor */ - nodeinfo[8] = dd-ipath_minrev | (dd-ipath_majrev 16); - nodeinfo[9] = (dd-ipath_unit 24) | (dd-ipath_vendorid 0); - - return simple_read_from_buffer(buf, count, ppos, nodeinfo, - sizeof nodeinfo); -} - -static const struct file_operations atomic_node_info_ops = { - .read = atomic_node_info_read, -}; - -static ssize_t atomic_port_info_read(struct file *file, char __user *buf, -size_t count, loff_t *ppos) -{ - u32 portinfo[13]; - u32 tmp, tmp2; - struct ipath_devdata *dd; - - dd = file-f_path.dentry-d_inode-i_private; - - /* so we only initialize non-zero fields. */ - memset(portinfo, 0, sizeof portinfo); - - /* -* Notimpl yet M_Key (64) -* Notimpl yet GID (64) -*/ - - portinfo[4] = (dd-ipath_lid 16); - - /* -* Notimpl yet SMLID. -* CapabilityMask is 0, we don't support any of these -* DiagCode is 0; we don't store any diag info for now Notimpl yet -* M_KeyLeasePeriod (we don't support M_Key) -*/ - - /* LocalPortNum is whichever port number they ask for */ - portinfo[7] = (dd-ipath_unit 24) - /* LinkWidthEnabled */ - | (2 16) - /* LinkWidthSupported (really 2, but not IB valid) */ - | (3 8) - /* LinkWidthActive */ - | (2 0); - tmp = dd-ipath_lastibcstat IPATH_IBSTATE_MASK; - tmp2 = 5; - if (tmp == IPATH_IBSTATE_INIT) - tmp = 2; - else if (tmp == IPATH_IBSTATE_ARM) - tmp = 3; - else if (tmp == IPATH_IBSTATE_ACTIVE) - tmp = 4; - else { - tmp = 0;/* down */ - tmp2 = tmp 0xf; - } - - portinfo[8] = (1 28) /* LinkSpeedSupported */ - | (tmp 24) /* PortState */ - | (tmp2 20) /* PortPhysicalState */ - | (2 16) - - /* LinkDownDefaultState */ - /* M_KeyProtectBits == 0 */ - /* NotImpl yet LMC == 0 (we can support all values) */ - | (1 4) /* LinkSpeedActive */ - | (1 0); /* LinkSpeedEnabled */ - switch (dd-ipath_ibmtu) { - case 4096: - tmp = 5; - break; - case 2048: - tmp = 4; - break; - case 1024: - tmp = 3; - break; - case 512: - tmp = 2; - break; - case 256: - tmp = 1; - break; - default:/* oops, something is wrong */ - ipath_dbg(Problem, ipath_ibmtu 0x%x not a valid IB MTU, - treat as 2048\n, dd-ipath_ibmtu); - tmp = 4; - break; - } - portinfo[9] = (tmp 28) - /* NeighborMTU */ - /* Notimpl MasterSMSL */ - | (1 20) - - /* VLCap */ - /* Notimpl InitType (actually, an SMA decision) */
[ofa-general] [PATCH 06/23] IB/ipath - correctly describe workaround for TID write chip bug
From: Dave Olson [EMAIL PROTECTED] This is a comment change, only, correcting the comment to match the implemented workaround, rather than the original workaround, and clarifying why it's needed. Signed-off-by: Dave Olson [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_iba6120.c | 13 - 1 files changed, 8 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c index a324c6f..d43f0b3 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c @@ -1143,11 +1143,14 @@ static void ipath_pe_put_tid(struct ipath_devdata *dd, u64 __iomem *tidptr, pa |= 2 29; } - /* workaround chip bug 9437 by writing each TID twice -* and holding a spinlock around the writes, so they don't -* intermix with other TID (eager or expected) writes -* Unfortunately, this call can be done from interrupt level -* for the port 0 eager TIDs, so we have to use irqsave + /* +* Workaround chip bug 9437 by writing the scratch register +* before and after the TID, and with an io write barrier. +* We use a spinlock around the writes, so they can't intermix +* with other TID (eager or expected) writes (the chip bug +* is triggered by back to back TID writes). Unfortunately, this +* call can be done from interrupt level for the port 0 eager TIDs, +* so we have to use irqsave locks. */ spin_lock_irqsave(dd-ipath_tid_lock, flags); ipath_write_kreg(dd, dd-ipath_kregs-kr_scratch, 0xfeeddeaf); ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 07/23] IB/ipath - UC RDMA WRITE with IMMEDIATE doesn't send the immediate
From: Ralph Campbell [EMAIL PROTECTED] This patch fixes a bug in the receive processing for UC RDMA WRITE with immediate which caused the last packet to be dropped. Signed-off-by: Ralph Campbell [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_uc.c | 21 +++-- 1 files changed, 11 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_uc.c b/drivers/infiniband/hw/ipath/ipath_uc.c index 767beb9..2dd8de2 100644 --- a/drivers/infiniband/hw/ipath/ipath_uc.c +++ b/drivers/infiniband/hw/ipath/ipath_uc.c @@ -464,6 +464,16 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, case OP(RDMA_WRITE_LAST_WITH_IMMEDIATE): rdma_last_imm: + if (header_in_data) { + wc.imm_data = *(__be32 *) data; + data += sizeof(__be32); + } else { + /* Immediate data comes after BTH */ + wc.imm_data = ohdr-u.imm_data; + } + hdrsize += 4; + wc.wc_flags = IB_WC_WITH_IMM; + /* Get the number of bytes the message was padded by. */ pad = (be32_to_cpu(ohdr-bth[0]) 20) 3; /* Check for invalid length. */ @@ -484,16 +494,7 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, dev-n_pkt_drops++; goto done; } - if (header_in_data) { - wc.imm_data = *(__be32 *) data; - data += sizeof(__be32); - } else { - /* Immediate data comes after BTH */ - wc.imm_data = ohdr-u.imm_data; - } - hdrsize += 4; - wc.wc_flags = IB_WC_WITH_IMM; - wc.byte_len = 0; + wc.byte_len = qp-r_len; goto last_imm; case OP(RDMA_WRITE_LAST): ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 08/23] IB/ipath - future proof eeprom checksum code (contents reading)
From: Dave Olson [EMAIL PROTECTED] In an earlier change, the amount of data read from the flash was mistakenly limited to the size known to the current driver. This causes problems when the length is increased, and written with the new longer version; the checksum would fail because not enough data was read. Always read the full 128 byte length to prevent this. Signed-off-by: Dave Olson [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_eeprom.c | 10 -- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_eeprom.c b/drivers/infiniband/hw/ipath/ipath_eeprom.c index b4503e9..bcfa3cc 100644 --- a/drivers/infiniband/hw/ipath/ipath_eeprom.c +++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c @@ -596,7 +596,11 @@ void ipath_get_eeprom_info(struct ipath_devdata *dd) goto bail; } - len = offsetof(struct ipath_flash, if_future); + /* +* read full flash, not just currently used part, since it may have +* been written with a newer definition +* */ + len = sizeof(struct ipath_flash); buf = vmalloc(len); if (!buf) { ipath_dev_err(dd, Couldn't allocate memory to read %u @@ -737,8 +741,10 @@ int ipath_update_eeprom_log(struct ipath_devdata *dd) /* * The quick-check above determined that there is something worthy * of logging, so get current contents and do a more detailed idea. +* read full flash, not just currently used part, since it may have +* been written with a newer definition */ - len = offsetof(struct ipath_flash, if_future); + len = sizeof(struct ipath_flash); buf = vmalloc(len); ret = 1; if (!buf) { ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 09/23] IB/ipath - Remove redundant code
From: Ralph Campbell [EMAIL PROTECTED] This patch removes some redundant initialization code. Signed-off-by: Ralph Campbell [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_driver.c |5 - 1 files changed, 0 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 8fa2bb5..e5d058a 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -381,8 +381,6 @@ static int __devinit ipath_init_one(struct pci_dev *pdev, ipath_cdbg(VERBOSE, initializing unit #%u\n, dd-ipath_unit); - read_bars(dd, pdev, bar0, bar1); - ret = pci_enable_device(pdev); if (ret) { /* This can happen iff: @@ -528,9 +526,6 @@ static int __devinit ipath_init_one(struct pci_dev *pdev, goto bail_regions; } - dd-ipath_deviceid = ent-device; /* save for later use */ - dd-ipath_vendorid = ent-vendor; - dd-ipath_pcirev = pdev-revision; #if defined(__powerpc__) ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 12/23] IB/ipath - optimize completion queue entry insertion and polling
From: Ralph Campbell [EMAIL PROTECTED] The code to add an entry to the completion queue stored the QPN which is needed for the user level verbs view of the completion queue entry but the kernel struct ib_wc contains a pointer to the QP instead of a QPN. When the kernel polled for a completion queue entry, the QPN was lookup up and the QP pointer recovered. This patch stores the CQE differently based on whether the CQ is a kernel CQ or a user CQ thus avoiding the QPN to QP lookup overhead. Signed-off-by: Ralph Campbell [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_cq.c| 94 +++-- drivers/infiniband/hw/ipath/ipath_verbs.h |6 ++ 2 files changed, 53 insertions(+), 47 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c index a6f04d2..645ed71 100644 --- a/drivers/infiniband/hw/ipath/ipath_cq.c +++ b/drivers/infiniband/hw/ipath/ipath_cq.c @@ -76,22 +76,25 @@ void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int solicited) } return; } - wc-queue[head].wr_id = entry-wr_id; - wc-queue[head].status = entry-status; - wc-queue[head].opcode = entry-opcode; - wc-queue[head].vendor_err = entry-vendor_err; - wc-queue[head].byte_len = entry-byte_len; - wc-queue[head].imm_data = (__u32 __force)entry-imm_data; - wc-queue[head].qp_num = entry-qp-qp_num; - wc-queue[head].src_qp = entry-src_qp; - wc-queue[head].wc_flags = entry-wc_flags; - wc-queue[head].pkey_index = entry-pkey_index; - wc-queue[head].slid = entry-slid; - wc-queue[head].sl = entry-sl; - wc-queue[head].dlid_path_bits = entry-dlid_path_bits; - wc-queue[head].port_num = entry-port_num; - /* Make sure queue entry is written before the head index. */ - smp_wmb(); + if (cq-ip) { + wc-uqueue[head].wr_id = entry-wr_id; + wc-uqueue[head].status = entry-status; + wc-uqueue[head].opcode = entry-opcode; + wc-uqueue[head].vendor_err = entry-vendor_err; + wc-uqueue[head].byte_len = entry-byte_len; + wc-uqueue[head].imm_data = (__u32 __force)entry-imm_data; + wc-uqueue[head].qp_num = entry-qp-qp_num; + wc-uqueue[head].src_qp = entry-src_qp; + wc-uqueue[head].wc_flags = entry-wc_flags; + wc-uqueue[head].pkey_index = entry-pkey_index; + wc-uqueue[head].slid = entry-slid; + wc-uqueue[head].sl = entry-sl; + wc-uqueue[head].dlid_path_bits = entry-dlid_path_bits; + wc-uqueue[head].port_num = entry-port_num; + /* Make sure entry is written before the head index. */ + smp_wmb(); + } else + wc-kqueue[head] = *entry; wc-head = next; if (cq-notify == IB_CQ_NEXT_COMP || @@ -130,6 +133,12 @@ int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) int npolled; u32 tail; + /* The kernel can only poll a kernel completion queue */ + if (cq-ip) { + npolled = -EINVAL; + goto bail; + } + spin_lock_irqsave(cq-lock, flags); wc = cq-queue; @@ -137,31 +146,10 @@ int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) if (tail (u32) cq-ibcq.cqe) tail = (u32) cq-ibcq.cqe; for (npolled = 0; npolled num_entries; ++npolled, ++entry) { - struct ipath_qp *qp; - if (tail == wc-head) break; - /* Make sure entry is read after head index is read. */ - smp_rmb(); - qp = ipath_lookup_qpn(to_idev(cq-ibcq.device)-qp_table, - wc-queue[tail].qp_num); - entry-qp = qp-ibqp; - if (atomic_dec_and_test(qp-refcount)) - wake_up(qp-wait); - - entry-wr_id = wc-queue[tail].wr_id; - entry-status = wc-queue[tail].status; - entry-opcode = wc-queue[tail].opcode; - entry-vendor_err = wc-queue[tail].vendor_err; - entry-byte_len = wc-queue[tail].byte_len; - entry-imm_data = wc-queue[tail].imm_data; - entry-src_qp = wc-queue[tail].src_qp; - entry-wc_flags = wc-queue[tail].wc_flags; - entry-pkey_index = wc-queue[tail].pkey_index; - entry-slid = wc-queue[tail].slid; - entry-sl = wc-queue[tail].sl; - entry-dlid_path_bits = wc-queue[tail].dlid_path_bits; - entry-port_num = wc-queue[tail].port_num; + /* The kernel doesn't need a RMB since it has the lock. */ + *entry = wc-kqueue[tail]; if (tail = cq-ibcq.cqe) tail = 0; else @@
[ofa-general] [PATCH 13/23] IB/ipath -- Add ability to set the LMC via the sysfs debugging interface
From: Ralph Campbell [EMAIL PROTECTED] This patch adds the ability to set the LMC via a sysfs file as if the SM sent a SubnSet(PortInfo) MAD. It is useful for debugging when no SM is running. Signed-off-by: Ralph Campbell [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_sysfs.c | 40 - 1 files changed, 39 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_sysfs.c b/drivers/infiniband/hw/ipath/ipath_sysfs.c index 16238cd..e1ad7cf 100644 --- a/drivers/infiniband/hw/ipath/ipath_sysfs.c +++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c @@ -163,6 +163,42 @@ static ssize_t show_boardversion(struct device *dev, return scnprintf(buf, PAGE_SIZE, %s, dd-ipath_boardversion); } +static ssize_t show_lmc(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct ipath_devdata *dd = dev_get_drvdata(dev); + + return scnprintf(buf, PAGE_SIZE, %u\n, dd-ipath_lmc); +} + +static ssize_t store_lmc(struct device *dev, +struct device_attribute *attr, +const char *buf, +size_t count) +{ + struct ipath_devdata *dd = dev_get_drvdata(dev); + u16 lmc = 0; + int ret; + + ret = ipath_parse_ushort(buf, lmc); + if (ret 0) + goto invalid; + + if (lmc 7) { + ret = -EINVAL; + goto invalid; + } + + ipath_set_lid(dd, dd-ipath_lid, lmc); + + goto bail; +invalid: + ipath_dev_err(dd, attempt to set invalid LMC %u\n, lmc); +bail: + return ret; +} + static ssize_t show_lid(struct device *dev, struct device_attribute *attr, char *buf) @@ -190,7 +226,7 @@ static ssize_t store_lid(struct device *dev, goto invalid; } - ipath_set_lid(dd, lid, 0); + ipath_set_lid(dd, lid, dd-ipath_lmc); goto bail; invalid: @@ -648,6 +684,7 @@ static struct attribute_group driver_attr_group = { }; static DEVICE_ATTR(guid, S_IWUSR | S_IRUGO, show_guid, store_guid); +static DEVICE_ATTR(lmc, S_IWUSR | S_IRUGO, show_lmc, store_lmc); static DEVICE_ATTR(lid, S_IWUSR | S_IRUGO, show_lid, store_lid); static DEVICE_ATTR(link_state, S_IWUSR, NULL, store_link_state); static DEVICE_ATTR(mlid, S_IWUSR | S_IRUGO, show_mlid, store_mlid); @@ -667,6 +704,7 @@ static DEVICE_ATTR(logged_errors, S_IRUGO, show_logged_errs, NULL); static struct attribute *dev_attributes[] = { dev_attr_guid.attr, + dev_attr_lmc.attr, dev_attr_lid.attr, dev_attr_link_state.attr, dev_attr_mlid.attr, ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 14/23] IB/ipath - remove duplicate copy of LMC
From: Ralph Campbell [EMAIL PROTECTED] The LMC value was being saved by the SMA in two places. This patch cleans it up so only one copy is kept. Signed-off-by: Ralph Campbell [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_mad.c | 39 - drivers/infiniband/hw/ipath/ipath_ud.c| 10 --- drivers/infiniband/hw/ipath/ipath_verbs.c |4 +-- drivers/infiniband/hw/ipath/ipath_verbs.h |2 + 4 files changed, 29 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c b/drivers/infiniband/hw/ipath/ipath_mad.c index d61c030..8f15216 100644 --- a/drivers/infiniband/hw/ipath/ipath_mad.c +++ b/drivers/infiniband/hw/ipath/ipath_mad.c @@ -245,7 +245,7 @@ static int recv_subn_get_portinfo(struct ib_smp *smp, /* Only return the mkey if the protection field allows it. */ if (smp-method == IB_MGMT_METHOD_SET || dev-mkey == smp-mkey || - (dev-mkeyprot_resv_lmc 6) == 0) + dev-mkeyprot == 0) pip-mkey = dev-mkey; pip-gid_prefix = dev-gid_prefix; lid = dev-dd-ipath_lid; @@ -264,7 +264,7 @@ static int recv_subn_get_portinfo(struct ib_smp *smp, pip-portphysstate_linkdown = (ipath_cvt_physportstate[ibcstat 0xf] 4) | (get_linkdowndefaultstate(dev-dd) ? 1 : 2); - pip-mkeyprot_resv_lmc = dev-mkeyprot_resv_lmc; + pip-mkeyprot_resv_lmc = (dev-mkeyprot 6) | dev-dd-ipath_lmc; pip-linkspeedactive_enabled = 0x11;/* 2.5Gbps, 2.5Gbps */ switch (dev-dd-ipath_ibmtu) { case 4096: @@ -401,6 +401,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, struct ib_port_info *pip = (struct ib_port_info *)smp-data; struct ib_event event; struct ipath_ibdev *dev; + struct ipath_devdata *dd; u32 flags; char clientrereg = 0; u16 lid, smlid; @@ -415,6 +416,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, goto err; dev = to_idev(ibdev); + dd = dev-dd; event.device = ibdev; event.element.port_num = port; @@ -423,11 +425,12 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, dev-mkey_lease_period = be16_to_cpu(pip-mkey_lease_period); lid = be16_to_cpu(pip-lid); - if (lid != dev-dd-ipath_lid) { + if (dd-ipath_lid != lid || + dd-ipath_lmc != (pip-mkeyprot_resv_lmc 7)) { /* Must be a valid unicast LID address. */ if (lid == 0 || lid = IPATH_MULTICAST_LID_BASE) goto err; - ipath_set_lid(dev-dd, lid, pip-mkeyprot_resv_lmc 7); + ipath_set_lid(dd, lid, pip-mkeyprot_resv_lmc 7); event.event = IB_EVENT_LID_CHANGE; ib_dispatch_event(event); } @@ -461,18 +464,18 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, case 0: /* NOP */ break; case 1: /* SLEEP */ - if (set_linkdowndefaultstate(dev-dd, 1)) + if (set_linkdowndefaultstate(dd, 1)) goto err; break; case 2: /* POLL */ - if (set_linkdowndefaultstate(dev-dd, 0)) + if (set_linkdowndefaultstate(dd, 0)) goto err; break; default: goto err; } - dev-mkeyprot_resv_lmc = pip-mkeyprot_resv_lmc; + dev-mkeyprot = pip-mkeyprot_resv_lmc 6; dev-vl_high_limit = pip-vl_high_limit; switch ((pip-neighbormtu_mastersmsl 4) 0xF) { @@ -495,7 +498,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, /* XXX We have already partially updated our state! */ goto err; } - ipath_set_mtu(dev-dd, mtu); + ipath_set_mtu(dd, mtu); dev-sm_sl = pip-neighbormtu_mastersmsl 0xF; @@ -511,16 +514,16 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, * later. */ if (pip-pkey_violations == 0) - dev-z_pkey_violations = ipath_get_cr_errpkey(dev-dd); + dev-z_pkey_violations = ipath_get_cr_errpkey(dd); if (pip-qkey_violations == 0) dev-qkey_violations = 0; ore = pip-localphyerrors_overrunerrors; - if (set_phyerrthreshold(dev-dd, (ore 4) 0xF)) + if (set_phyerrthreshold(dd, (ore 4) 0xF)) goto err; - if (set_overrunthreshold(dev-dd, (ore 0xF))) + if (set_overrunthreshold(dd, (ore 0xF))) goto err; dev-subnet_timeout = pip-clientrereg_resv_subnetto 0x1F; @@ -538,7 +541,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, * is down or is being set to down. */ state = pip-linkspeed_portstate 0xF; - flags = dev-dd-ipath_flags; + flags = dd-ipath_flags; lstate = (pip-portphysstate_linkdown 4) 0xF; if (lstate
[ofa-general] [PATCH 20/23] IB/ipath - better handling of unexpected GPIO interrupts
From: Michael Albaugh [EMAIL PROTECTED] The General Purpose I/O pins can be configured to cause interrupts. At the end of the interrupt code dealing with all known causes, a message is output if any bits remain un-handled. Since this is a can't happen scenario, it should only be triggered by bugs elsewhere. It is harmless, and potentially beneficial, to limit the damage by masking any such unexpected interrupts. This patch adds disabling of interrupts from any pins that should not have been allowed to interrupt, in addition to emitting a message. Signed-off-by: Michael Albaugh [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_intr.c | 10 ++ 1 files changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 61eac8c..801a20d 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -1124,10 +1124,8 @@ irqreturn_t ipath_intr(int irq, void *data) /* * Some unexpected bits remain. If they could have * caused the interrupt, complain and clear. -* MEA: this is almost certainly non-ideal. -* we should look into auto-disable of unexpected -* GPIO interrupts, possibly on a three strikes -* basis. +* To avoid repetition of this condition, also clear +* the mask. It is almost certainly due to error. */ const u32 mask = (u32) dd-ipath_gpio_mask; @@ -1135,6 +1133,10 @@ irqreturn_t ipath_intr(int irq, void *data) ipath_dbg(Unexpected GPIO IRQ bits %x\n, gpiostatus mask); to_clear |= (gpiostatus mask); + dd-ipath_gpio_mask = ~(gpiostatus mask); + ipath_write_kreg(dd, + dd-ipath_kregs-kr_gpio_mask, + dd-ipath_gpio_mask); } } if (to_clear) { ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 21/23] IB/ipath - fix IB_EVENT_PORT_ERR event
From: Ralph Campbell [EMAIL PROTECTED] The link state event calls were being generated when the SM told the SMA to change link states. This works for IB_EVENT_PORT_ACTIVE but not if the link goes down and stays down. The fix is to generate event calls from the interrupt handler when the HW link state changes. Signed-off-by: Ralph Campbell [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_driver.c |2 ++ drivers/infiniband/hw/ipath/ipath_intr.c | 17 + drivers/infiniband/hw/ipath/ipath_kernel.h |2 ++ drivers/infiniband/hw/ipath/ipath_mad.c| 10 -- drivers/infiniband/hw/ipath/ipath_verbs.c | 12 ++-- 5 files changed, 31 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index e5d058a..799fac2 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -2085,6 +2085,8 @@ void ipath_shutdown_device(struct ipath_devdata *dd) INFINIPATH_IBCC_LINKINITCMD_SHIFT); ipath_cancel_sends(dd, 0); + signal_ib_event(dd, IB_EVENT_PORT_ERR); + /* disable IBC */ dd-ipath_control = ~INFINIPATH_C_LINKENABLE; ipath_write_kreg(dd, dd-ipath_kregs-kr_control, diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 801a20d..6a5dd5c 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -275,6 +275,16 @@ static char *ib_linkstate(u32 linkstate) return ret; } +void signal_ib_event(struct ipath_devdata *dd, enum ib_event_type ev) +{ + struct ib_event event; + + event.device = dd-verbs_dev-ibdev; + event.element.port_num = 1; + event.event = ev; + ib_dispatch_event(event); +} + static void handle_e_ibstatuschanged(struct ipath_devdata *dd, ipath_err_t errs, int noprint) { @@ -373,6 +383,8 @@ static void handle_e_ibstatuschanged(struct ipath_devdata *dd, dd-ipath_ibpollcnt = 0;/* some state other than 2 or 3 */ ipath_stats.sps_iblink++; if (ltstate != INFINIPATH_IBCS_LT_STATE_LINKUP) { + if (dd-ipath_flags IPATH_LINKACTIVE) + signal_ib_event(dd, IB_EVENT_PORT_ERR); dd-ipath_flags |= IPATH_LINKDOWN; dd-ipath_flags = ~(IPATH_LINKUNK | IPATH_LINKINIT | IPATH_LINKACTIVE | @@ -405,7 +417,10 @@ static void handle_e_ibstatuschanged(struct ipath_devdata *dd, *dd-ipath_statusp |= IPATH_STATUS_IB_READY | IPATH_STATUS_IB_CONF; dd-ipath_f_setextled(dd, lstate, ltstate); + signal_ib_event(dd, IB_EVENT_PORT_ACTIVE); } else if ((val IPATH_IBSTATE_MASK) == IPATH_IBSTATE_INIT) { + if (dd-ipath_flags IPATH_LINKACTIVE) + signal_ib_event(dd, IB_EVENT_PORT_ERR); /* * set INIT and DOWN. Down is checked by most of the other * code, but INIT is useful to know in a few places. @@ -418,6 +433,8 @@ static void handle_e_ibstatuschanged(struct ipath_devdata *dd, | IPATH_STATUS_IB_READY); dd-ipath_f_setextled(dd, lstate, ltstate); } else if ((val IPATH_IBSTATE_MASK) == IPATH_IBSTATE_ARM) { + if (dd-ipath_flags IPATH_LINKACTIVE) + signal_ib_event(dd, IB_EVENT_PORT_ERR); dd-ipath_flags |= IPATH_LINKARMED; dd-ipath_flags = ~(IPATH_LINKUNK | IPATH_LINKDOWN | IPATH_LINKINIT | diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index 872fb36..8786dd7 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -42,6 +42,7 @@ #include linux/pci.h #include linux/dma-mapping.h #include asm/io.h +#include rdma/ib_verbs.h #include ipath_common.h #include ipath_debug.h @@ -775,6 +776,7 @@ void ipath_get_eeprom_info(struct ipath_devdata *); int ipath_update_eeprom_log(struct ipath_devdata *dd); void ipath_inc_eeprom_err(struct ipath_devdata *dd, u32 eidx, u32 incr); u64 ipath_snap_cntr(struct ipath_devdata *, ipath_creg); +void signal_ib_event(struct ipath_devdata *dd, enum ib_event_type ev); /* * Set LED override, only the two LSBs have public meaning, but diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c b/drivers/infiniband/hw/ipath/ipath_mad.c index 8f15216..0ae3a7c 100644 --- a/drivers/infiniband/hw/ipath/ipath_mad.c +++ b/drivers/infiniband/hw/ipath/ipath_mad.c @@ -570,26 +570,16 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, else goto err; ipath_set_linkstate(dd, lstate); - if (flags
[ofa-general] [PATCH 22/23] IB/ipath - remove redundant link state checks
From: Ralph Campbell [EMAIL PROTECTED] This patch removes some redundant checks when the SMA changes the link state since the same checks are made in the lower level function that sets the state. Signed-off-by: Ralph Campbell [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_mad.c |6 -- 1 files changed, 0 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c b/drivers/infiniband/hw/ipath/ipath_mad.c index 0ae3a7c..3d1432d 100644 --- a/drivers/infiniband/hw/ipath/ipath_mad.c +++ b/drivers/infiniband/hw/ipath/ipath_mad.c @@ -402,7 +402,6 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, struct ib_event event; struct ipath_ibdev *dev; struct ipath_devdata *dd; - u32 flags; char clientrereg = 0; u16 lid, smlid; u8 lwe; @@ -541,7 +540,6 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, * is down or is being set to down. */ state = pip-linkspeed_portstate 0xF; - flags = dd-ipath_flags; lstate = (pip-portphysstate_linkdown 4) 0xF; if (lstate !(state == IB_PORT_DOWN || state == IB_PORT_NOP)) goto err; @@ -572,13 +570,9 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, ipath_set_linkstate(dd, lstate); break; case IB_PORT_ARMED: - if (!(flags (IPATH_LINKINIT | IPATH_LINKACTIVE))) - break; ipath_set_linkstate(dd, IPATH_IB_LINKARM); break; case IB_PORT_ACTIVE: - if (!(flags IPATH_LINKARMED)) - break; ipath_set_linkstate(dd, IPATH_IB_LINKACTIVE); break; default: ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 23/23] IB/ipath -- Minor fix to ordering of freeing and zeroing of tid pages.
From: Dave Olson [EMAIL PROTECTED] Fixed to be the same as everywhere else. copy and then zero the page * in the array first, and then pass the copy to the VM routines. Signed-off-by: Dave Olson [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_file_ops.c |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c index 016e7c4..5de3243 100644 --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c @@ -538,6 +538,9 @@ static int ipath_tid_free(struct ipath_portdata *pd, unsigned subport, continue; cnt++; if (dd-ipath_pageshadow[porttid + tid]) { + struct page *p; + p = dd-ipath_pageshadow[porttid + tid]; + dd-ipath_pageshadow[porttid + tid] = NULL; ipath_cdbg(VERBOSE, PID %u freeing TID %u\n, pd-port_pid, tid); dd-ipath_f_put_tid(dd, tidbase[tid], @@ -546,9 +549,7 @@ static int ipath_tid_free(struct ipath_portdata *pd, unsigned subport, pci_unmap_page(dd-pcidev, dd-ipath_physshadow[porttid + tid], PAGE_SIZE, PCI_DMA_FROMDEVICE); - ipath_release_user_pages( - dd-ipath_pageshadow[porttid + tid], 1); - dd-ipath_pageshadow[porttid + tid] = NULL; + ipath_release_user_pages(p, 1); ipath_stats.sps_pageunlocks++; } else ipath_dbg(Unused tid %u, ignoring\n, tid); ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] SDP ?
That should work fine. You might be able to build with -D_XPG4_2 as well. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jim Langston Sent: Tuesday, October 09, 2007 10:13 AM To: general@lists.openfabrics.org Subject: [ofa-general] SDP ? Hi all, I'm working on porting SDP to OpenSolaris and am looking at a compile error that I get. Essentially, I have a conflict of types on the compile: bash-3.00$ /opt/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I. -I.. -g -D_POSIX_PTHREAD_SEMANTICS -DSYSCONFDIR=\/usr/local/etc\ -g -D_POSIX_PTHREAD_SEMANTICS -c port.c -KPIC -DPIC -o .libs/port.o port.c, line 1896: identifier redeclared: getsockname current : function(int, pointer to struct sockaddr {unsigned short sa_family, array[14] of char sa_data}, pointer to unsigned int) returning int previous: function(int, pointer to struct sockaddr {unsigned short sa_family, array[14] of char sa_data}, pointer to void) returning int : /usr/include/sys/socket.h, line 436 Line 436 in /usr/include/sys/socket.h extern int getsockname(int, struct sockaddr *_RESTRICT_KYWD, Psocklen_t); and Psocklen_t #if defined(_XPG4_2) || defined(_BOOT) typedef socklen_t *_RESTRICT_KYWD Psocklen_t; #else typedef void*_RESTRICT_KYWD Psocklen_t; #endif /* defined(_XPG4_2) || defined(_BOOT) */ Do I need to change port.c getsockname to type void * ? Thanks, Jim ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 15/23] IB/ipath - use counters in ipath_poll and cleanup interrupts in ipath_close
ipath_poll() suffered from a couple subtle bugs. Under the right conditions we could leave recv interrupts enabled on an ipath user context on close, thereby taking potentially unwanted interrupts on the next open -- this is fixed by unconditionally turning off recv interrupts on close. Also, we now use counters rather than set/clear bits which allows us to make sure we catch all interrupts at the cost of changing the semantics slightly (it's now give me all events since the last time I called poll() rather than give me all events since I called _this_ poll routine). We also added some memory barriers which may help ensure we get all notifications in a timely manner. Signed-off-by: Arthur Jones [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_file_ops.c | 67 -- drivers/infiniband/hw/ipath/ipath_intr.c | 33 - drivers/infiniband/hw/ipath/ipath_kernel.h |8 ++- 3 files changed, 57 insertions(+), 51 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c index 33ab0d6..016e7c4 100644 --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c @@ -1341,6 +1341,19 @@ bail: return ret; } +static unsigned ipath_poll_hdrqfull(struct ipath_portdata *pd) +{ + unsigned pollflag = 0; + + if ((pd-poll_type IPATH_POLL_TYPE_OVERFLOW) + pd-port_hdrqfull != pd-port_hdrqfull_poll) { + pollflag |= POLLIN | POLLRDNORM; + pd-port_hdrqfull_poll = pd-port_hdrqfull; + } + + return pollflag; +} + static unsigned int ipath_poll_urgent(struct ipath_portdata *pd, struct file *fp, struct poll_table_struct *pt) @@ -1350,22 +1363,20 @@ static unsigned int ipath_poll_urgent(struct ipath_portdata *pd, dd = pd-port_dd; - if (test_bit(IPATH_PORT_WAITING_OVERFLOW, pd-int_flag)) { - pollflag |= POLLERR; - clear_bit(IPATH_PORT_WAITING_OVERFLOW, pd-int_flag); - } + /* variable access in ipath_poll_hdrqfull() needs this */ + rmb(); + pollflag = ipath_poll_hdrqfull(pd); - if (test_bit(IPATH_PORT_WAITING_URG, pd-int_flag)) { + if (pd-port_urgent != pd-port_urgent_poll) { pollflag |= POLLIN | POLLRDNORM; - clear_bit(IPATH_PORT_WAITING_URG, pd-int_flag); + pd-port_urgent_poll = pd-port_urgent; } if (!pollflag) { + /* this saves a spin_lock/unlock in interrupt handler... */ set_bit(IPATH_PORT_WAITING_URG, pd-port_flag); - if (pd-poll_type IPATH_POLL_TYPE_OVERFLOW) - set_bit(IPATH_PORT_WAITING_OVERFLOW, - pd-port_flag); - + /* flush waiting flag so don't miss an event... */ + wmb(); poll_wait(fp, pd-port_wait, pt); } @@ -1376,31 +1387,27 @@ static unsigned int ipath_poll_next(struct ipath_portdata *pd, struct file *fp, struct poll_table_struct *pt) { - u32 head, tail; + u32 head; + u32 tail; unsigned pollflag = 0; struct ipath_devdata *dd; dd = pd-port_dd; + /* variable access in ipath_poll_hdrqfull() needs this */ + rmb(); + pollflag = ipath_poll_hdrqfull(pd); + head = ipath_read_ureg32(dd, ur_rcvhdrhead, pd-port_port); tail = *(volatile u64 *)pd-port_rcvhdrtail_kvaddr; - if (test_bit(IPATH_PORT_WAITING_OVERFLOW, pd-int_flag)) { - pollflag |= POLLERR; - clear_bit(IPATH_PORT_WAITING_OVERFLOW, pd-int_flag); - } - - if (tail != head || - test_bit(IPATH_PORT_WAITING_RCV, pd-int_flag)) { + if (head != tail) pollflag |= POLLIN | POLLRDNORM; - clear_bit(IPATH_PORT_WAITING_RCV, pd-int_flag); - } - - if (!pollflag) { + else { + /* this saves a spin_lock/unlock in interrupt handler */ set_bit(IPATH_PORT_WAITING_RCV, pd-port_flag); - if (pd-poll_type IPATH_POLL_TYPE_OVERFLOW) - set_bit(IPATH_PORT_WAITING_OVERFLOW, - pd-port_flag); + /* flush waiting flag so we don't miss an event */ + wmb(); set_bit(pd-port_port + INFINIPATH_R_INTRAVAIL_SHIFT, dd-ipath_rcvctrl); @@ -1917,6 +1924,12 @@ static int ipath_do_user_init(struct file *fp, ipath_cdbg(VERBOSE, Wrote port%d egrhead %x from tail regs\n, pd-port_port, head32); pd-port_tidcursor = 0; /* start at beginning after open */ + + /* initialize poll variables... */ + pd-port_urgent = 0; + pd-port_urgent_poll = 0; +
Re: [ofa-general] [PATCH] rdma/cm: add locking around QP accesses
Did we ever get any confirmation that this fixed the problem that Olaf saw? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
From: Jeff Garzik [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 08:44:25 -0400 David Miller wrote: From: Krishna Kumar2 [EMAIL PROTECTED] Date: Tue, 9 Oct 2007 16:51:14 +0530 David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM: Ignore LLTX, it sucks, it was a big mistake, and we will get rid of it. Great, this will make life easy. Any idea how long that would take? It seems simple enough to do. I'd say we can probably try to get rid of it in 2.6.25, this is assuming we get driver authors to cooperate and do the conversions or alternatively some other motivated person. I can just threaten to do them all and that should get the driver maintainers going :-) What, like this? :) Thanks, but it's probably going to need some corrections and/or an audit. If you unconditionally take those locks in the transmit function, there is probably an ABBA deadlock elsewhere in the driver now, most likely in the TX reclaim processing, and you therefore need to handle that too. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH] fix some ehca limits
I didn't see a response to my earlier email about the other uses of min_t(int, x, INT_MAX) so I fixed it up myself and added this to my tree. I don't have a working setup to test yet so please let me know if you see anything wrong with this: commit 919225e60a1a73e3518f257f040f74e9379a61c3 Author: Roland Dreier [EMAIL PROTECTED] Date: Tue Oct 9 13:17:42 2007 -0700 IB/ehca: Fix clipping of device limits to INT_MAX Doing min_t(int, foo, INT_MAX) doesn't work correctly, because if foo is bigger than INT_MAX, then when treated as a signed integer, it will become negative and hence such an expression is just an elaborate NOP. Fix such cases in ehca to do min_t(unsigned, foo, INT_MAX) instead. This fixes negative reported values for max_cqe, max_pd and max_ah: Before: max_cqe:-64 max_pd: -1 max_ah: -1 After: max_cqe:2147483647 max_pd: 2147483647 max_ah: 2147483647 Based on a bug report and fix from Anton Blanchard [EMAIL PROTECTED]. Signed-off-by: Roland Dreier [EMAIL PROTECTED] diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index 3436c49..4aa3ffa 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -82,17 +82,17 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) props-vendor_id = rblock-vendor_id 8; props-vendor_part_id = rblock-vendor_part_id 16; props-hw_ver = rblock-hw_ver; - props-max_qp = min_t(int, rblock-max_qp, INT_MAX); - props-max_qp_wr = min_t(int, rblock-max_wqes_wq, INT_MAX); - props-max_sge = min_t(int, rblock-max_sge, INT_MAX); - props-max_sge_rd = min_t(int, rblock-max_sge_rd, INT_MAX); - props-max_cq = min_t(int, rblock-max_cq, INT_MAX); - props-max_cqe = min_t(int, rblock-max_cqe, INT_MAX); - props-max_mr = min_t(int, rblock-max_mr, INT_MAX); - props-max_mw = min_t(int, rblock-max_mw, INT_MAX); - props-max_pd = min_t(int, rblock-max_pd, INT_MAX); - props-max_ah = min_t(int, rblock-max_ah, INT_MAX); - props-max_fmr = min_t(int, rblock-max_mr, INT_MAX); + props-max_qp = min_t(unsigned, rblock-max_qp, INT_MAX); + props-max_qp_wr = min_t(unsigned, rblock-max_wqes_wq, INT_MAX); + props-max_sge = min_t(unsigned, rblock-max_sge, INT_MAX); + props-max_sge_rd = min_t(unsigned, rblock-max_sge_rd, INT_MAX); + props-max_cq = min_t(unsigned, rblock-max_cq, INT_MAX); + props-max_cqe = min_t(unsigned, rblock-max_cqe, INT_MAX); + props-max_mr = min_t(unsigned, rblock-max_mr, INT_MAX); + props-max_mw = min_t(unsigned, rblock-max_mw, INT_MAX); + props-max_pd = min_t(unsigned, rblock-max_pd, INT_MAX); + props-max_ah = min_t(unsigned, rblock-max_ah, INT_MAX); + props-max_fmr = min_t(unsigned, rblock-max_mr, INT_MAX); if (EHCA_BMASK_GET(HCA_CAP_SRQ, shca-hca_cap)) { props-max_srq = props-max_qp; @@ -104,15 +104,15 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) props-local_ca_ack_delay = rblock-local_ca_ack_delay; props-max_raw_ipv6_qp - = min_t(int, rblock-max_raw_ipv6_qp, INT_MAX); + = min_t(unsigned, rblock-max_raw_ipv6_qp, INT_MAX); props-max_raw_ethy_qp - = min_t(int, rblock-max_raw_ethy_qp, INT_MAX); + = min_t(unsigned, rblock-max_raw_ethy_qp, INT_MAX); props-max_mcast_grp - = min_t(int, rblock-max_mcast_grp, INT_MAX); + = min_t(unsigned, rblock-max_mcast_grp, INT_MAX); props-max_mcast_qp_attach - = min_t(int, rblock-max_mcast_qp_attach, INT_MAX); + = min_t(unsigned, rblock-max_mcast_qp_attach, INT_MAX); props-max_total_mcast_qp_attach - = min_t(int, rblock-max_total_mcast_qp_attach, INT_MAX); + = min_t(unsigned, rblock-max_total_mcast_qp_attach, INT_MAX); /* translate device capabilities */ props-device_cap_flags = IB_DEVICE_SYS_IMAGE_GUID | ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 16/23] IB/ipath - iba6110 rev4 no longer needs recv header overrun workaround
iba6110 rev3 and earlier had a chip bug where the chip could overrun the recv header queue. rev4 fixed this chip bug so userspace no longer needs to workaround it. Now we only set the workaround flag for older chip versions. Signed-off-by: Arthur Jones [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_iba6110.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c b/drivers/infiniband/hw/ipath/ipath_iba6110.c index e1c5998..d4940be 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6110.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c @@ -1599,8 +1599,10 @@ static int ipath_ht_get_base_info(struct ipath_portdata *pd, void *kbase) { struct ipath_base_info *kinfo = kbase; - kinfo-spi_runtime_flags |= IPATH_RUNTIME_HT | - IPATH_RUNTIME_RCVHDR_COPY; + kinfo-spi_runtime_flags |= IPATH_RUNTIME_HT; + + if (pd-port_dd-ipath_minrev 4) + kinfo-spi_runtime_flags |= IPATH_RUNTIME_RCVHDR_COPY; return 0; } ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 11/23] IB/ipath - implement IB_EVENT_QP_LAST_WQE_REACHED
From: Ralph Campbell [EMAIL PROTECTED] This patch implements the IB_EVENT_QP_LAST_WQE_REACHED event which is needed by ib_ipoib to destroy the QP when used in connected mode. Signed-off-by: Ralph Campbell [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_qp.c| 20 +--- drivers/infiniband/hw/ipath/ipath_rc.c| 12 +++- drivers/infiniband/hw/ipath/ipath_verbs.h |2 +- 3 files changed, 29 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c index a8c4a6b..6a41fdb 100644 --- a/drivers/infiniband/hw/ipath/ipath_qp.c +++ b/drivers/infiniband/hw/ipath/ipath_qp.c @@ -377,13 +377,15 @@ static void ipath_reset_qp(struct ipath_qp *qp) * @err: the receive completion error to signal if a RWQE is active * * Flushes both send and receive work queues. + * Returns true if last WQE event should be generated. * The QP s_lock should be held and interrupts disabled. */ -void ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err) +int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err) { struct ipath_ibdev *dev = to_idev(qp-ibqp.device); struct ib_wc wc; + int ret = 0; ipath_dbg(QP%d/%d in error state\n, qp-ibqp.qp_num, qp-remote_qpn); @@ -454,7 +456,10 @@ void ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err) wq-tail = tail; spin_unlock(qp-r_rq.lock); - } + } else if (qp-ibqp.event_handler) + ret = 1; + + return ret; } /** @@ -473,6 +478,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, struct ipath_qp *qp = to_iqp(ibqp); enum ib_qp_state cur_state, new_state; unsigned long flags; + int lastwqe = 0; int ret; spin_lock_irqsave(qp-s_lock, flags); @@ -532,7 +538,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, break; case IB_QPS_ERR: - ipath_error_qp(qp, IB_WC_WR_FLUSH_ERR); + lastwqe = ipath_error_qp(qp, IB_WC_WR_FLUSH_ERR); break; default: @@ -591,6 +597,14 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, qp-state = new_state; spin_unlock_irqrestore(qp-s_lock, flags); + if (lastwqe) { + struct ib_event ev; + + ev.device = qp-ibqp.device; + ev.element.qp = qp-ibqp; + ev.event = IB_EVENT_QP_LAST_WQE_REACHED; + qp-ibqp.event_handler(ev, qp-ibqp.qp_context); + } ret = 0; goto bail; diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c index 53259da..5c29b2b 100644 --- a/drivers/infiniband/hw/ipath/ipath_rc.c +++ b/drivers/infiniband/hw/ipath/ipath_rc.c @@ -1497,11 +1497,21 @@ send_ack: static void ipath_rc_error(struct ipath_qp *qp, enum ib_wc_status err) { unsigned long flags; + int lastwqe; spin_lock_irqsave(qp-s_lock, flags); qp-state = IB_QPS_ERR; - ipath_error_qp(qp, err); + lastwqe = ipath_error_qp(qp, err); spin_unlock_irqrestore(qp-s_lock, flags); + + if (lastwqe) { + struct ib_event ev; + + ev.device = qp-ibqp.device; + ev.element.qp = qp-ibqp; + ev.event = IB_EVENT_QP_LAST_WQE_REACHED; + qp-ibqp.event_handler(ev, qp-ibqp.qp_context); + } } static inline void ipath_update_ack_queue(struct ipath_qp *qp, unsigned n) diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index 619ad72..a197229 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -672,7 +672,7 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, int ipath_destroy_qp(struct ib_qp *ibqp); -void ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err); +int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err); int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, struct ib_udata *udata); ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 10/23] IB/ipath - generate flush CQE when QP is in error state.
From: Ralph Campbell [EMAIL PROTECTED] Follow the IB spec. (C10-96) for post send which states that a flushed completion event should be generated when the QP is in the error state. Signed-off-by: Ralph Campbell [EMAIL PROTECTED] --- drivers/infiniband/hw/ipath/ipath_verbs.c | 22 -- 1 files changed, 20 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 3cc82b6..495194b 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -230,6 +230,18 @@ void ipath_skip_sge(struct ipath_sge_state *ss, u32 length) } } +static void ipath_flush_wqe(struct ipath_qp *qp, struct ib_send_wr *wr) +{ + struct ib_wc wc; + + memset(wc, 0, sizeof(wc)); + wc.wr_id = wr-wr_id; + wc.status = IB_WC_WR_FLUSH_ERR; + wc.opcode = ib_ipath_wc_opcode[wr-opcode]; + wc.qp = qp-ibqp; + ipath_cq_enter(to_icq(qp-ibqp.send_cq), wc, 1); +} + /** * ipath_post_one_send - post one RC, UC, or UD send work request * @qp: the QP to post on @@ -248,8 +260,14 @@ static int ipath_post_one_send(struct ipath_qp *qp, struct ib_send_wr *wr) spin_lock_irqsave(qp-s_lock, flags); /* Check that state is OK to post send. */ - if (!(ib_ipath_state_ops[qp-state] IPATH_POST_SEND_OK)) - goto bail_inval; + if (unlikely(!(ib_ipath_state_ops[qp-state] IPATH_POST_SEND_OK))) { + if (qp-state != IB_QPS_SQE qp-state != IB_QPS_ERR) + goto bail_inval; + /* C10-96 says generate a flushed completion entry. */ + ipath_flush_wqe(qp, wr); + ret = 0; + goto bail; + } /* IB spec says that num_sge == 0 is OK. */ if (wr-num_sge qp-s_max_sge) ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH] rdma/cm: add locking around QP accesses
Did we ever get any confirmation that this fixed the problem that Olaf saw? No. I haven't seen a response. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCHES] TX batching rev2.5
Please provide feedback on the code and/or architecture. They are now updated to work with the latest rebased net-2.6.24 from a few hours ago. I am on travel mode so wont have time to do more testing for the next few days - i do consider this to be stable at this point based on what i have been testing (famous last words). Patch 1: Introduces batching interface Patch 2: Core uses batching interface Patch 3: get rid of dev-gso_skb What has changed since i posted last: - 1) There was some cruft leftover from prep_frame feature that i forgot to remove last time - it is now gone. 2) In the shower this AM, i realized that it is plausible that a batch of packets sent to the driver may all be dropped because they are badly formatted. Current drivers all return NETDEV_TX_OK in all such cases. This will lead for us to call dev-hard_end_xmit() to be invoked unnecessarily. I already had a NETDEV_TX_DROPPED within batching drivers, so i made that global and make the batching drivers return it if they drop packets. The core calls dev-hard_end_xmit() when at least one packet makes it through. Things i am gonna say that nobody will see (wink) - Dave please let me know if this meets your desires to allow devices which are SG and able to compute CSUM benefit just in case i misunderstood. Herbert, if you can look at at least patch 3 i will appreaciate it (since it kills dev-gso_skb that you introduced). UPCOMING PATCHES --- As before: More patches to follow later if i get some feedback - i didnt want to overload people by dumping too many patches. Most of these patches mentioned below are ready to go; some need some re-testing and others need a little porting from an earlier kernel: - tg3 driver - tun driver - pktgen - netiron driver - e1000e driver (non-LLTX) - ethtool interface - There is at least one other driver promised to me Theres also a driver-howto that i will post soon today. PERFORMANCE TESTING System under test hardware is still a 2xdual core opteron with a couple of tg3s. A test tool generates udp traffic of different sizes for upto 60 seconds per run or a total of 30M packets. I have 4 threads each running on a specific CPU which keep all the CPUs as busy as they can sending packets targetted at a directly connected box's udp discard port. All 4 CPUs target a single tg3 to send. The receiving box has a tc rule which counts and drops all incoming udp packets to discard port - this allows me to make sure that the receiver is not the bottleneck in the testing. Packet sizes sent are {8B, 32B, 64B, 128B, 256B, 512B, 1024B}. Each packet size run is repeated 10 times to ensure that there are no transients. The average of all 10 runs is then computed and collected. I do plan also to run forwarding and TCP tests in the future when the dust settles. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 1/3] [NET_BATCH] Introduce batching interface Rev2.5
This patch introduces the netdevice interface for batching. cheers, jamal [NET_BATCH] Introduce batching interface This patch introduces the netdevice interface for batching. BACKGROUND - A driver dev-hard_start_xmit() has 4 typical parts: a) packet formating (example vlan, mss, descriptor counting etc) b) chip specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interupts, set last tx time, etc [For code cleanliness/readability sake, regardless of this work, one should break the dev-hard_start_xmit() into those 4 functions anyways]. INTRODUCING API --- With the api introduced in this patch, a driver which has all 4 parts and needing to support batching is advised to split its dev-hard_start_xmit() in the following manner: 1)Remove #d from dev-hard_start_xmit() and put it in dev-hard_end_xmit() method. 2)#b and #c can stay in -hard_start_xmit() (or whichever way you want to do this) 3) #a is deffered to future work to reduce confusion (since it holds on its own). Note: There are drivers which may need not support any of the two approaches (example the tun driver i patched) so the methods are optional. xmit_win variable is set by the driver to tell the core how much space it has to take on new skbs. It is introduced to ensure that when we pass the driver a list of packets it will swallow all of them - which is useful because we dont requeue to the qdisc (and avoids burning unnecessary cpu cycles or introducing any strange re-ordering). The driver tells us when it invokes netif_wake_queue how much space it has for descriptors by setting this variable. Refer to the driver howto for more details. THEORY OF OPERATION --- 1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented and GSO packets are accounted for as well. 2. Core grabs TX_LOCK 3. Core loop for all skbs: invokes driver dev-hard_start_xmit() 4. Core invokes driver dev-hard_end_xmit() ACKNOWLEDGEMENT AND SOME HISTORY There's a lot of history and reasoning of why batching in a document i am writting which i may submit as a patch. Thomas Graf (who doesnt know this probably) gave me the impetus to start looking at this back in 2004 when he invited me to the linux conference he was organizing. Parts of what i presented in SUCON in 2004 talk about batching. Herbert Xu forced me to take a second look around 2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided me with more motivation in May 2007 when he posted on netdev and engaged me. Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan, Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, David Miller, and Patrick McHardy, Jeff Garzik and Bill Fink have contributed in one or more of {bug fixes, enhancements, testing, lively discussion}. The Broadcom and neterion folks have been outstanding in their help. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit 98d39ea7922fa2719a80eecd02cae359f3d7 tree 63822bf3040ea41846399c589c912c2be654f008 parent 7b4cd20628fe5c4e145c383fcd8d954d38f7be61 author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:06:28 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:06:28 -0400 include/linux/netdevice.h |9 +- net/core/dev.c| 67 ++--- net/sched/sch_generic.c |4 +-- 3 files changed, 73 insertions(+), 7 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 91cd3f3..b0e71c9 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -86,6 +86,7 @@ struct wireless_dev; /* Driver transmit return codes */ #define NETDEV_TX_OK 0 /* driver took care of packet */ #define NETDEV_TX_BUSY 1 /* driver tx path was busy*/ +#define NETDEV_TX_DROPPED 2 /* driver tx path dropped packet*/ #define NETDEV_TX_LOCKED -1 /* driver tx lock was already taken */ /* @@ -467,6 +468,7 @@ struct net_device #define NETIF_F_NETNS_LOCAL 8192 /* Does not change network namespaces */ #define NETIF_F_MULTI_QUEUE 16384 /* Has multiple TX/RX queues */ #define NETIF_F_LRO 32768 /* large receive offload */ +#define NETIF_F_BTX 65536 /* Capable of batch tx */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -595,6 +597,9 @@ struct net_device void *priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + void (*hard_end_xmit) (struct net_device *dev); + int xmit_win; + /* These may be needed for future network-power-down code. */ unsigned long trans_start; /* Time (in jiffies) of last Tx */ @@ -609,6 +614,7 @@ struct net_device /* delayed register/unregister */ struct list_head todo_list; + struct sk_buff_head blist; /* device index hash chain */ struct hlist_node index_hlist; @@ -1043,7 +1049,8 @@
[ofa-general] [PATCH 2/3][NET_BATCH] Rev2.5 net core use batching
This patch adds the usage of batching within the core. cheers, jamal [NET_BATCH] net core use batching This patch adds the usage of batching within the core. Performance results demonstrating improvement are provided separately. I have #if-0ed some of the old functions so the patch is more readable. A future patch will remove all if-0ed content. Patrick McHardy eyeballed a bug that will cause re-ordering in case of a requeue. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit c73d8ee8cce61a98f55fbfb2cafe813a7eca472c tree 8b9155fe15baa4c2e7adb69585c7aa275a6bc896 parent 98d39ea7922fa2719a80eecd02cae359f3d7 author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:13:30 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:13:30 -0400 net/sched/sch_generic.c | 104 ++- 1 files changed, 94 insertions(+), 10 deletions(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 424c08b..d98c680 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q) return q-q.qlen; } +#if 0 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { @@ -110,6 +111,85 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, return ret; } +#endif + +static inline int handle_dev_cpu_collision(struct net_device *dev) +{ + if (unlikely(dev-xmit_lock_owner == smp_processor_id())) { + if (net_ratelimit()) + printk(KERN_WARNING +Dead loop on netdevice %s, fix it urgently!\n, +dev-name); + return 1; + } + __get_cpu_var(netdev_rx_stat).cpu_collision++; + return 0; +} + +static inline int +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + + struct sk_buff *skb; + + while ((skb = __skb_dequeue_tail(skbs)) != NULL) + q-ops-requeue(skb, q); + + netif_schedule(dev); + return 0; +} + +static inline int +xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + int ret = handle_dev_cpu_collision(dev); + + if (ret) { + if (!skb_queue_empty(skbs)) + skb_queue_purge(skbs); + return qdisc_qlen(q); + } + + return dev_requeue_skbs(skbs, dev, q); +} + +static int xmit_count_skbs(struct sk_buff *skb) +{ + int count = 0; + for (; skb; skb = skb-next) { + count += skb_shinfo(skb)-nr_frags; + count += 1; + } + return count; +} + +static int xmit_get_pkts(struct net_device *dev, + struct Qdisc *q, + struct sk_buff_head *pktlist) +{ + struct sk_buff *skb; + int count = dev-xmit_win; + + if (count dev-gso_skb) { + skb = dev-gso_skb; + dev-gso_skb = NULL; + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + while (count 0) { + skb = q-dequeue(q); + if (!skb) + break; + + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + return skb_queue_len(pktlist); +} /* * NOTE: Called under dev-queue_lock with locally disabled BH. @@ -133,19 +213,20 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, static inline int qdisc_restart(struct net_device *dev) { struct Qdisc *q = dev-qdisc; - struct sk_buff *skb; - int ret, xcnt = 0; + int ret = 0; - /* Dequeue packet */ - if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL)) - return 0; + /* Dequeue packets */ + ret = xmit_get_pkts(dev, q, dev-blist); + if (!ret) + return 0; - /* And release queue */ + /* We got em packets */ spin_unlock(dev-queue_lock); + /* bye packets */ HARD_TX_LOCK(dev, smp_processor_id()); - ret = dev_hard_start_xmit(skb, dev, xcnt); + ret = dev_batch_xmit(dev); HARD_TX_UNLOCK(dev); spin_lock(dev-queue_lock); @@ -158,8 +239,8 @@ static inline int qdisc_restart(struct net_device *dev) break; case NETDEV_TX_LOCKED: - /* Driver try lock failed */ - ret = handle_dev_cpu_collision(skb, dev, q); + /* Driver lock failed */ + ret = xmit_islocked(dev-blist, dev, q); break; default: @@ -168,7 +249,7 @@ static inline int qdisc_restart(struct net_device *dev) printk(KERN_WARNING BUG %s code %d qlen %d\n, dev-name, ret, q-q.qlen); - ret = dev_requeue_skb(skb, dev, q); + ret = dev_requeue_skbs(dev-blist, dev, q); break; } @@ -564,6 +645,9 @@ void dev_deactivate(struct net_device *dev) skb = dev-gso_skb; dev-gso_skb = NULL; + if (!skb_queue_empty(dev-blist)) + skb_queue_purge(dev-blist); + dev-xmit_win = 1; spin_unlock_bh(dev-queue_lock); kfree_skb(skb); ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 3/3][NET_BATCH] Rev2.5 kill dev-gso_skb
This patch removes dev-gso_skb as it is no longer necessary with batching code. cheers, jamal [NET_BATCH] kill dev-gso_skb The batching code does what gso used to batch at the drivers. There is no more need for gso_skb. If for whatever reason the requeueing is a bad idea we are going to leave packets in dev-blist (and still not need dev-gso_skb) Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit fac8a4147548f314d4edb74634e78e5b06e0e135 tree 72114acb327bc7e3eb219275df6b3aab7459795c parent c73d8ee8cce61a98f55fbfb2cafe813a7eca472c author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:22:43 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:22:43 -0400 include/linux/netdevice.h |3 --- net/sched/sch_generic.c | 12 2 files changed, 0 insertions(+), 15 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b0e71c9..7592a56 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -578,9 +578,6 @@ struct net_device struct list_head qdisc_list; unsigned long tx_queue_len; /* Max frames per queue allowed */ - /* Partially transmitted GSO packet. */ - struct sk_buff *gso_skb; - /* ingress path synchronizer */ spinlock_t ingress_lock; struct Qdisc *qdisc_ingress; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index d98c680..36b6972 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev, struct sk_buff *skb; int count = dev-xmit_win; - if (count dev-gso_skb) { - skb = dev-gso_skb; - dev-gso_skb = NULL; - count -= xmit_count_skbs(skb); - __skb_queue_tail(pktlist, skb); - } - while (count 0) { skb = q-dequeue(q); if (!skb) @@ -635,7 +628,6 @@ void dev_activate(struct net_device *dev) void dev_deactivate(struct net_device *dev) { struct Qdisc *qdisc; - struct sk_buff *skb; spin_lock_bh(dev-queue_lock); qdisc = dev-qdisc; @@ -643,15 +635,11 @@ void dev_deactivate(struct net_device *dev) qdisc_reset(qdisc); - skb = dev-gso_skb; - dev-gso_skb = NULL; if (!skb_queue_empty(dev-blist)) skb_queue_purge(dev-blist); dev-xmit_win = 1; spin_unlock_bh(dev-queue_lock); - kfree_skb(skb); - dev_watchdog_down(dev); /* Wait for outstanding dev_queue_xmit calls. */ ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCHES] TX batching rev2.5
On Tue, 2007-09-10 at 18:07 -0400, jamal wrote: Please provide feedback on the code and/or architecture. They are now updated to work with the latest rebased net-2.6.24 from a few hours ago. I should have added i have tested this with just the batching changes and it is within the performance realm of the changes from yesterday. If anyone wants exact numbers, i can send them. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [DOC][NET_BATCH]Rev2.5 Driver Howto
I updated this doc to match the latest patch. cheers, jamal Here's the beginning of a howto for driver authors. The intended audience for this howto is people already familiar with netdevices. 1.0 Netdevice Prerequisites -- For hardware-based netdevices, you must have at least hardware that is capable of doing DMA with many descriptors; i.e., having hardware with a queue length of 3 (as in some fscked ethernet hardware) is not very useful in this case. 2.0 What is new in the driver API --- There is 1 new method and one new variable introduced that the driver author needs to be aware of. These are: 1) dev-hard_end_xmit() 2) dev-xmit_win 2.1 Using Core driver changes - To provide context, let's look at a typical driver abstraction for dev-hard_start_xmit(). It has 4 parts: a) packet formatting (example: vlan, mss, descriptor counting, etc.) b) chip-specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interrupts, etc. [For code cleanliness/readability sake, regardless of this work, one should break the dev-hard_start_xmit() into those 4 functional blocks anyways]. A driver which has all 4 parts and needing to support batching is advised to split its dev-hard_start_xmit() in the following manner: 1) use its dev-hard_end_xmit() method to achieve #d 2) use dev-xmit_win to tell the core how much space you have. #b and #c can stay in -hard_start_xmit() (or whichever way you want to do this) Section 3. shows more details on the suggested usage. 2.1.1 Theory of operation -- 1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented and GSO packets are accounted for as well. 2. Core grabs device's TX_LOCK 3. Core loop for all skbs: -invokes driver dev-hard_start_xmit() 4. Core invokes driver dev-hard_end_xmit() if packets xmitted 2.1.1.1 The slippery LLTX - Since these type of drivers are being phased out and they require extra code they will not be supported anymore. So as oct07 the code that supports them has been removed. 2.1.1.2 xmit_win dev-xmit_win variable is set by the driver to tell us how much space it has in its rings/queues. This detail is then used to figure out how many packets are retrieved from the qdisc queues (in order to send to the driver). dev-xmit_win is introduced to ensure that when we pass the driver a list of packets it will swallow all of them -- which is useful because we don't requeue to the qdisc (and avoids burning unnecessary CPU cycles or introducing any strange re-ordering). Essentially the driver signals us how much space it has for descriptors by setting this variable. 2.1.1.2.1 Setting xmit_win -- This variable should be set during xmit path shutdown(netif_stop), wakeup(netif_wake) and -hard_end_xmit(). In the case of the first one the value is set to 1 and in the other two it is set to whatever the driver deems to be available space on the ring. 3.0 Driver Essentials - The typical driver tx state machine is: -1- +Core sends packets +-- Driver puts packet onto hardware queue +if hardware queue is full, netif_stop_queue(dev) + -2- +core stops sending because of netif_stop_queue(dev) .. .. time passes ... .. -3- +--- driver has transmitted packets, opens up tx path by invoking netif_wake_queue(dev) -1- +Cycle repeats and core sends more packets (step 1). 3.1 Driver prerequisite -- This is _a very important_ requirement in making batching useful. The prerequisite for batching changes is that the driver should provide a low threshold to open up the tx path. Drivers such as tg3 and e1000 already do this. Before you invoke netif_wake_queue(dev) you check if there is a threshold of space reached to insert new packets. Here's an example of how I added it to tun driver. Observe the setting of dev-xmit_win. --- +#define NETDEV_LTT 4 /* the low threshold to open up the tx path */ .. .. u32 t = skb_queue_len(tun-readq); if (netif_queue_stopped(tun-dev) t NETDEV_LTT) { tun-dev-xmit_win = tun-dev-tx_queue_len; netif_wake_queue(tun-dev); } --- Heres how the batching e1000 driver does it: -- if (unlikely(cleaned netif_carrier_ok(netdev) E1000_DESC_UNUSED(tx_ring) = TX_WAKE_THRESHOLD)) { if (netif_queue_stopped(netdev)) { int rspace = E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS + 2); netdev-xmit_win = rspace; netif_wake_queue(netdev); } --- in tg3 code (with no batching changes) looks like: - if (netif_queue_stopped(tp-dev) (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH(tp))) netif_wake_queue(tp-dev); ---
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
Before you add new entries to your list, how is that ibm driver NAPI conversion coming along? :-) OK, thanks for the kick in the pants, I have a couple of patches for net-2.6.24 coming (including an unrelated trivial warning fix for IPoIB). - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 1/1] IB/SDP - Zero copy bcopy support
This patch adds zero copy send support to SDP. Below 2K transfer size, it is better to bcopy. With larger transfers, this is a net win on bandwidth. Latency testing is yet to be done. BCOPYBZCOPY 1K TCP_STREAM 3555 Mb/sec 2250 Mb/sec 2K TCP_STREAM 3650 Mb/sec 3785 Mb/sec 4K TCP_STREAM 3560 Mb/sec 6220 Mb/sec 8K TCP_STREAM 3555 Mb/sec 6190 Mb/sec 16K TCP_STREAM 5100 Mb/sec 6155 Mb/sec 1M TCP_STREAM 4630 Mb/sec 6210 Mb/sec Performance work still remains. Open issues include correct setsockopt defines (use previous SDP values?), code cleanup, performance tuning, rigorous regression testing, and multi-OS build+test. Simple testing to date includes netperf and iperf, ^C recovery, unload/load, and checking for gross memory leaks on Rhat4u4. Signed-off-by: Jim Mott [EMAIL PROTECTED] --- Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp.h === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/sdp/sdp.h2007-10-08 08:20:57.0 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp.h 2007-10-08 08:31:41.0 -0500 @@ -50,6 +50,9 @@ extern int sdp_data_debug_level; #define SDP_HEAD_SIZE (PAGE_SIZE / 2 + sizeof(struct sdp_bsdh)) #define SDP_NUM_WC 4 +#define SDP_MIN_ZCOPY_THRESH1024 +#define SDP_MAX_ZCOPY_THRESH 1048576 + #define SDP_OP_RECV 0x8LL enum sdp_mid { @@ -70,6 +73,13 @@ enum { SDP_MIN_BUFS = 2 }; +enum { + SDP_ERR_ERROR = -4, + SDP_ERR_FAULT = -3, + SDP_NEW_SEG = -2, + SDP_DO_WAIT_MEM = -1 +}; + struct rdma_cm_id; struct rdma_cm_event; @@ -148,6 +158,9 @@ struct sdp_sock { int recv_frags; int send_frags; + /* ZCOPY data */ + int zcopy_thresh; + struct ib_sge ibsge[SDP_MAX_SEND_SKB_FRAGS + 1]; struct ib_wc ibwc[SDP_NUM_WC]; }; Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_main.c === --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/sdp/sdp_main.c 2007-10-08 08:21:05.0 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_main.c2007-10-09 16:52:34.0 -0500 @@ -65,6 +65,16 @@ unsigned int csum_partial_copy_from_user #include sdp.h #include linux/delay.h +struct bzcopy_state { + unsigned char __user *u_base; + intu_len; + int left; + intpage_cnt; + intcur_page; + intcur_offset; + struct page **pages; +}; + MODULE_AUTHOR(Michael S. Tsirkin); MODULE_DESCRIPTION(InfiniBand SDP module); MODULE_LICENSE(Dual BSD/GPL); @@ -117,6 +127,10 @@ static int send_poll_thresh = 8192; module_param_named(send_poll_thresh, send_poll_thresh, int, 0644); MODULE_PARM_DESC(send_poll_thresh, Send message size thresh hold over which to start polling.); +static int sdp_zcopy_thresh = 0; +module_param_named(sdp_zcopy_thresh, sdp_zcopy_thresh, int, 0644); +MODULE_PARM_DESC(sdp_zcopy_thresh, Zero copy send threshold; 0=0ff.); + struct workqueue_struct *sdp_workqueue; static struct list_head sock_list; @@ -867,6 +881,12 @@ static int sdp_setsockopt(struct sock *s sdp_push_pending_frames(sk); } break; + case SDP_ZCOPY_THRESH: + if (val SDP_MIN_ZCOPY_THRESH || val SDP_MAX_ZCOPY_THRESH) + err = -EINVAL; + else + ssk-zcopy_thresh = val; + break; default: err = -ENOPROTOOPT; break; @@ -904,6 +924,9 @@ static int sdp_getsockopt(struct sock *s case TCP_CORK: val = !!(ssk-nonagleTCP_NAGLE_CORK); break; + case SDP_ZCOPY_THRESH: + val = ssk-zcopy_thresh ? ssk-zcopy_thresh : sdp_zcopy_thresh; + break; default: return -ENOPROTOOPT; } @@ -1051,10 +1074,252 @@ void sdp_push_one(struct sock *sk, unsig { } -/* Like tcp_sendmsg */ -/* TODO: check locking */ +static struct bzcopy_state *sdp_bz_cleanup(struct bzcopy_state *bz) +{ + int i; + + if (bz-pages) { + for (i = bz-cur_page; i bz-page_cnt; i++) + put_page(bz-pages[i]); + + kfree(bz-pages); + } + + kfree(bz); + + return NULL; +} + + +static struct bzcopy_state *sdp_bz_setup(struct sdp_sock *ssk, +unsigned char __user *base, +int len, +int size_goal) +{ + struct bzcopy_state *bz; + unsigned long addr; + unsigned long locked, locked_limit; + int done_pages; + int thresh; + + thresh = ssk-zcopy_thresh ? : sdp_zcopy_thresh; + if
[ofa-general] [PATCH 1/4] IPoIB: Fix unused variable warning
The conversion to use netdevice internal stats left an unused variable in ipoib_neigh_free(), since there's no longer any reason to get netdev_priv() in order to increment dropped packets. Delete the unused priv variable. Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- drivers/infiniband/ulp/ipoib/ipoib_main.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 6b1b4b2..855c9de 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -854,7 +854,6 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh) { - struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; *to_ipoib_neigh(neigh-neighbour) = NULL; while ((skb = __skb_dequeue(neigh-queue))) { ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 2/4] ibm_emac: Convert to use napi_struct independent of struct net_device
Commit da3dedd9 ([NET]: Make NAPI polling independent of struct net_device objects.) changed the interface to NAPI polling. Fix up the ibm_emac driver so that it works with this new interface. This is actually a nice cleanup because ibm_emac is one of the drivers that wants to have multiple NAPI structures for a single net_device. Tested with the internal MAC of a PowerPC 440SPe SoC with an AMCC 'Yucca' evaluation board. Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- drivers/net/ibm_emac/ibm_emac_mal.c | 48 -- drivers/net/ibm_emac/ibm_emac_mal.h |2 +- include/linux/netdevice.h | 10 +++ 3 files changed, 28 insertions(+), 32 deletions(-) diff --git a/drivers/net/ibm_emac/ibm_emac_mal.c b/drivers/net/ibm_emac/ibm_emac_mal.c index cabd984..cc3ddc9 100644 --- a/drivers/net/ibm_emac/ibm_emac_mal.c +++ b/drivers/net/ibm_emac/ibm_emac_mal.c @@ -207,10 +207,10 @@ static irqreturn_t mal_serr(int irq, void *dev_instance) static inline void mal_schedule_poll(struct ibm_ocp_mal *mal) { - if (likely(netif_rx_schedule_prep(mal-poll_dev))) { + if (likely(napi_schedule_prep(mal-napi))) { MAL_DBG2(%d: schedule_poll NL, mal-def-index); mal_disable_eob_irq(mal); - __netif_rx_schedule(mal-poll_dev); + __napi_schedule(mal-napi); } else MAL_DBG2(%d: already in poll NL, mal-def-index); } @@ -273,11 +273,11 @@ static irqreturn_t mal_rxde(int irq, void *dev_instance) return IRQ_HANDLED; } -static int mal_poll(struct net_device *ndev, int *budget) +static int mal_poll(struct napi_struct *napi, int budget) { - struct ibm_ocp_mal *mal = ndev-priv; + struct ibm_ocp_mal *mal = container_of(napi, struct ibm_ocp_mal, napi); struct list_head *l; - int rx_work_limit = min(ndev-quota, *budget), received = 0, done; + int received = 0; MAL_DBG2(%d: poll(%d) %d - NL, mal-def-index, *budget, rx_work_limit); @@ -295,38 +295,34 @@ static int mal_poll(struct net_device *ndev, int *budget) list_for_each(l, mal-poll_list) { struct mal_commac *mc = list_entry(l, struct mal_commac, poll_list); - int n = mc-ops-poll_rx(mc-dev, rx_work_limit); + int n = mc-ops-poll_rx(mc-dev, budget); if (n) { received += n; - rx_work_limit -= n; - if (rx_work_limit = 0) { - done = 0; + budget -= n; + if (budget = 0) goto more_work; // XXX What if this is the last one ? - } } } /* We need to disable IRQs to protect from RXDE IRQ here */ local_irq_disable(); - __netif_rx_complete(ndev); + __napi_complete(napi); mal_enable_eob_irq(mal); local_irq_enable(); - done = 1; - /* Check for rotting packet(s) */ list_for_each(l, mal-poll_list) { struct mal_commac *mc = list_entry(l, struct mal_commac, poll_list); if (unlikely(mc-ops-peek_rx(mc-dev) || mc-rx_stopped)) { MAL_DBG2(%d: rotting packet NL, mal-def-index); - if (netif_rx_reschedule(ndev, received)) + if (napi_reschedule(napi)) mal_disable_eob_irq(mal); else MAL_DBG2(%d: already in poll list NL, mal-def-index); - if (rx_work_limit 0) + if (budget 0) goto again; else goto more_work; @@ -335,12 +331,8 @@ static int mal_poll(struct net_device *ndev, int *budget) } more_work: - ndev-quota -= received; - *budget -= received; - - MAL_DBG2(%d: poll() %d - %d NL, mal-def-index, *budget, -done ? 0 : 1); - return done ? 0 : 1; + MAL_DBG2(%d: poll() %d - %d NL, mal-def-index, budget, received); + return received; } static void mal_reset(struct ibm_ocp_mal *mal) @@ -425,11 +417,8 @@ static int __init mal_probe(struct ocp_device *ocpdev) mal-def = ocpdev-def; INIT_LIST_HEAD(mal-poll_list); - set_bit(__LINK_STATE_START, mal-poll_dev.state); - mal-poll_dev.weight = CONFIG_IBM_EMAC_POLL_WEIGHT; - mal-poll_dev.poll = mal_poll; - mal-poll_dev.priv = mal; - atomic_set(mal-poll_dev.refcnt, 1); + mal-napi.weight = CONFIG_IBM_EMAC_POLL_WEIGHT; + mal-napi.poll = mal_poll; INIT_LIST_HEAD(mal-list); @@ -520,11 +509,8 @@ static void __exit mal_remove(struct ocp_device *ocpdev) MAL_DBG(%d: remove NL,
[ofa-general] [PATCH 3/4] ibm_new_emac: Nuke SET_MODULE_OWNER() use
Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- drivers/net/ibm_newemac/core.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/drivers/net/ibm_newemac/core.c b/drivers/net/ibm_newemac/core.c index ce127b9..8ea5009 100644 --- a/drivers/net/ibm_newemac/core.c +++ b/drivers/net/ibm_newemac/core.c @@ -2549,7 +2549,6 @@ static int __devinit emac_probe(struct of_device *ofdev, dev-ndev = ndev; dev-ofdev = ofdev; dev-blist = blist; - SET_MODULE_OWNER(ndev); SET_NETDEV_DEV(ndev, ofdev-dev); /* Initialize some embedded data structures */ ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 4/4] ibm_emac: Convert to use napi_struct independent of struct net_device
Commit da3dedd9 ([NET]: Make NAPI polling independent of struct net_device objects.) changed the interface to NAPI polling. Fix up the ibm_newemac driver so that it works with this new interface. This is actually a nice cleanup because ibm_newemac is one of the drivers that wants to have multiple NAPI structures for a single net_device. Compile-tested only as I don't have a system that uses the ibm_newemac driver. This conversion the conversion for the ibm_emac driver that was tested on real PowerPC 440SPe hardware. Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- drivers/net/ibm_newemac/mal.c | 55 ++-- drivers/net/ibm_newemac/mal.h |2 +- 2 files changed, 20 insertions(+), 37 deletions(-) diff --git a/drivers/net/ibm_newemac/mal.c b/drivers/net/ibm_newemac/mal.c index c4335b7..5885411 100644 --- a/drivers/net/ibm_newemac/mal.c +++ b/drivers/net/ibm_newemac/mal.c @@ -235,10 +235,10 @@ static irqreturn_t mal_serr(int irq, void *dev_instance) static inline void mal_schedule_poll(struct mal_instance *mal) { - if (likely(netif_rx_schedule_prep(mal-poll_dev))) { + if (likely(napi_schedule_prep(mal-napi))) { MAL_DBG2(mal, schedule_poll NL); mal_disable_eob_irq(mal); - __netif_rx_schedule(mal-poll_dev); + __napi_schedule(mal-napi); } else MAL_DBG2(mal, already in poll NL); } @@ -318,8 +318,7 @@ void mal_poll_disable(struct mal_instance *mal, struct mal_commac *commac) msleep(1); /* Synchronize with the MAL NAPI poller. */ - while (test_bit(__LINK_STATE_RX_SCHED, mal-poll_dev.state)) - msleep(1); + napi_disable(mal-napi); } void mal_poll_enable(struct mal_instance *mal, struct mal_commac *commac) @@ -330,11 +329,11 @@ void mal_poll_enable(struct mal_instance *mal, struct mal_commac *commac) // XXX might want to kick a poll now... } -static int mal_poll(struct net_device *ndev, int *budget) +static int mal_poll(struct napi_struct *napi, int budget) { - struct mal_instance *mal = netdev_priv(ndev); + struct mal_instance *mal = container_of(napi, struct mal_instance, napi); struct list_head *l; - int rx_work_limit = min(ndev-quota, *budget), received = 0, done; + int received = 0; unsigned long flags; MAL_DBG2(mal, poll(%d) %d - NL, *budget, @@ -358,26 +357,21 @@ static int mal_poll(struct net_device *ndev, int *budget) int n; if (unlikely(test_bit(MAL_COMMAC_POLL_DISABLED, mc-flags))) continue; - n = mc-ops-poll_rx(mc-dev, rx_work_limit); + n = mc-ops-poll_rx(mc-dev, budget); if (n) { received += n; - rx_work_limit -= n; - if (rx_work_limit = 0) { - done = 0; - // XXX What if this is the last one ? - goto more_work; - } + budget -= n; + if (budget = 0) + goto more_work; // XXX What if this is the last one ? } } /* We need to disable IRQs to protect from RXDE IRQ here */ spin_lock_irqsave(mal-lock, flags); - __netif_rx_complete(ndev); + __napi_complete(napi); mal_enable_eob_irq(mal); spin_unlock_irqrestore(mal-lock, flags); - done = 1; - /* Check for rotting packet(s) */ list_for_each(l, mal-poll_list) { struct mal_commac *mc = @@ -387,12 +381,12 @@ static int mal_poll(struct net_device *ndev, int *budget) if (unlikely(mc-ops-peek_rx(mc-dev) || test_bit(MAL_COMMAC_RX_STOPPED, mc-flags))) { MAL_DBG2(mal, rotting packet NL); - if (netif_rx_reschedule(ndev, received)) + if (napi_reschedule(napi)) mal_disable_eob_irq(mal); else MAL_DBG2(mal, already in poll list NL); - if (rx_work_limit 0) + if (budget 0) goto again; else goto more_work; @@ -401,13 +395,8 @@ static int mal_poll(struct net_device *ndev, int *budget) } more_work: - ndev-quota -= received; - *budget -= received; - - MAL_DBG2(mal, poll() %d - %d NL, *budget, -done ? 0 : 1); - - return done ? 0 : 1; + MAL_DBG2(mal, poll() %d - %d NL, budget, received); + return received; } static void mal_reset(struct mal_instance *mal) @@ -538,11 +527,8 @@ static int __devinit mal_probe(struct of_device *ofdev, }
Re: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch
I skipped over most of the code restructuring comments and focus mainly on design or issues. (Although code restructuring patches tend not to be written or easily accepted unless they fix a bug, and I would personally like to see at least some of the ones previously mentioned addressed before this code is merged. The ones listed below should be trivial to incorporate before merging.) This version incorporates some of Sean's comments, especially relating to locking. Sean's comments regarding module parameters, code restructure, ipoib_cm_rx state and the like will require more discussion and subsequent testing. They will be addressed with additional set of patches later on. This patch has been tested with linux-2.6.23-rc5 derived from Roland's for-2.6.24 git tree on ppc64 machines using IBM HCA. Signed-off-by: Pradeep Satyanarayana [EMAIL PROTECTED] --- --- a/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-31 12:14:30.0 -0500 +++ b/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-18 14:31:07.0 -0500 @@ -95,11 +95,13 @@ enum { IPOIB_MCAST_FLAG_ATTACHED = 3, }; +#define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE)) #defineIPOIB_OP_RECV (1ul 31) + #ifdef CONFIG_INFINIBAND_IPOIB_CM -#defineIPOIB_CM_OP_SRQ (1ul 30) +#defineIPOIB_CM_OP_RECV (1ul 30) #else -#defineIPOIB_CM_OP_SRQ (0) +#defineIPOIB_CM_OP_RECV (0) #endif /* structs */ @@ -166,11 +168,14 @@ enum ipoib_cm_state { }; struct ipoib_cm_rx { - struct ib_cm_id *id; - struct ib_qp*qp; - struct list_head list; - struct net_device *dev; - unsigned longjiffies; + struct ib_cm_id *id; + struct ib_qp*qp; + struct ipoib_cm_rx_buf *rx_ring; /* Used by no srq only */ + struct list_head list; + struct net_device *dev; + unsigned longjiffies; + u32 index; /* wr_ids are distinguished by index +* to identify the QP -no srq only */ enum ipoib_cm_state state; }; @@ -215,6 +220,8 @@ struct ipoib_cm_dev_priv { struct ib_wcibwc[IPOIB_NUM_WC]; struct ib_sge rx_sge[IPOIB_CM_RX_SG]; struct ib_recv_wr rx_wr; + struct ipoib_cm_rx **rx_index_table; /* See ipoib_cm_dev_init() + *for usage of this element */ }; /* @@ -438,6 +445,7 @@ void ipoib_drain_cq(struct net_device *d /* We don't support UC connections at the moment */ #define IPOIB_CM_SUPPORTED(ha) (ha[0] (IPOIB_FLAGS_RC)) +extern int max_rc_qp; static inline int ipoib_cm_admin_enabled(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); --- a/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-31 12:14:30.0 -0500 +++ b/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-18 17:04:06.0 -0500 @@ -49,6 +49,18 @@ MODULE_PARM_DESC(cm_data_debug_level, #include ipoib.h +int max_rc_qp = 128; +static int max_recv_buf = 1024; /* Default is 1024 MB */ + +module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0444); +MODULE_PARM_DESC(nosrq_max_rc_qp, Max number of no srq RC QPs supported; must be a power of 2); I thought you were going to remove the power of 2 restriction. And to re-start this discussion, I think we should separate the maximum number of QPs from whether we use SRQ, and let the QP type (UD, UC, RC) be controllable. Smaller clusters may perform better without using SRQ, even if it is available. And supporting UC versus RC seems like it should only take a few additional lines of code. +module_param_named(max_receive_buffer, max_recv_buf, int, 0644); +MODULE_PARM_DESC(max_receive_buffer, Max Receive Buffer Size in MB); Based on your response to my feedback, it sounds like the only reason we're keeping this parameter around is in case the admin sets some of the other values (max QPs, message size, RQ size) incorrectly. I agree with Roland that we need to come up with the correct user interface here, and I'm not convinced that what we have is the most adaptable for where the code could go. What about replacing the 2 proposed parameters with these 3? qp_type - ud, uc, rc use_srq - yes/no (default if available) max_conn_qp - uc or rc limit + +static atomic_t current_rc_qp = ATOMIC_INIT(0); /* Active number of RC QPs for no srq */ + +#define NOSRQ_INDEX_MASK (max_rc_qp -1) Just reserve lower bits of the wr_id for the rx_table to avoid the power of 2 restriction. #define IPOIB_CM_IETF_ID 0x1000ULL #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ) @@ -81,20 +93,21 @@ static void ipoib_cm_dma_unmap_rx(struct ib_dma_unmap_single(priv-ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); } -static int
Re: [ofa-general] [PATCH 4/4] ibm_emac: Convert to use napi_struct independent of struct net_device
Sorry... wrong subject here; it should have been ibm_newemac: ... - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/4] IPoIB: Fix unused variable warning
From: Roland Dreier [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 15:46:13 -0700 The conversion to use netdevice internal stats left an unused variable in ipoib_neigh_free(), since there's no longer any reason to get netdev_priv() in order to increment dropped packets. Delete the unused priv variable. Signed-off-by: Roland Dreier [EMAIL PROTECTED] Jeff, do you want to merge in Roland's 4 patches to your tree then do a sync with me so I can pull it all in from you? Alternative I can merge in Roland's work directly if that's easier for you. Just let me know. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch
Sean, Roland, I looked through Sean's latest comments. Yes, they are fairly easy to fix and I will fix them. The only one that might need some debate is the one associated with module parameters. In previous communications with Roland I got the impression that he wants to keep them (module parameters) at a minimum. So, how do we address that now? Last time around (after Sean's comments) I just addressed the bugs and skipped the rest since I had no idea as to how much time I had for the merge. These days I do not have exclusive access to the machines with IB adapters limiting the work I can do at a stretch. How much time do I have before this gets merged into the 2.6.24 tree? Other than the module parameters one I should be able to address the rest either by this evening (west coast US) or maybe in the morning/afternoon tomorrow. Will that be acceptable? Pradeep ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
From: jamal [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 17:56:46 -0400 if the h/ware queues are full because of link pressure etc, you drop. We drop today when the s/ware queues are full. The driver txmit lock takes place of the qdisc queue lock etc. I am assuming there is still need for that locking. The filter/classification scheme still works as is and select classes which map to rings. tc still works as is etc. I understand your suggestion. We have to keep in mind, however, that the sw queue right now is 1000 packets. I heavily discourage any driver author to try and use any single TX queue of that size. Which means that just dropping on back pressure might not work so well. Or it might be perfect and signal TCP to backoff, who knows! :-) While working out this issue in my mind, it occured to me that we can put the sw queue into the driver as well. The idea is that the network stack, as in the pure hw queue scheme, unconditionally always submits new packets to the driver. Therefore even if the hw TX queue is full, the driver can still queue to an internal sw queue with some limit (say 1000 for ethernet, as is used now). When the hw TX queue gains space, the driver self-batches packets from the sw queue to the hw queue. It sort of obviates the need for mid-level queue batching in the generic networking. Compared to letting the driver self-batch, the mid-level batching approach is pure overhead. We seem to be sort of all mentioning similar ideas. For example, you can get the above kind of scheme today by using a mid-level queue length of zero, and I believe this idea was mentioned by Stephen Hemminger earlier. I may experiment with this in the NIU driver. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch
And to re-start this discussion, I think we should separate the maximum number of QPs from whether we use SRQ, and let the QP type (UD, UC, RC) be controllable. Smaller clusters may perform better without using SRQ, even if it is available. And supporting UC versus RC seems like it should only take a few additional lines of code. Actually supporting UC is trickier than it seems, at least for the SRQ case, since attaching UC QPs to an SRQ requires that the IB spec be extended to allow that (and also define some semantics for how to handle messages that encounter an error in the middle of being received, after a work request has been taken from the SRQ). I agree with Roland that we need to come up with the correct user interface here, and I'm not convinced that what we have is the most adaptable for where the code could go. What about replacing the 2 proposed parameters with these 3? qp_type - ud, uc, rc use_srq - yes/no (default if available) max_conn_qp - uc or rc limit I don't think we want the qp_type to be a module parameter -- it seems we already have ud vs. rc handled via the parameter that enables connected mode, and if we want to enable uc we should do that in a similar per-interface way. Similarly if there's any point to making use_srq something that can be controlled, ideally it should be per-interface. But this could be tricky because it may be hard to change at runtime. (Ideally max_conn_qp would be per-interface too but that seems too hard as well) I do agree that the memory limit just seems arbitrary and we can probably do away with that. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/4] IPoIB: Fix unused variable warning
David Miller wrote: From: Roland Dreier [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 15:46:13 -0700 The conversion to use netdevice internal stats left an unused variable in ipoib_neigh_free(), since there's no longer any reason to get netdev_priv() in order to increment dropped packets. Delete the unused priv variable. Signed-off-by: Roland Dreier [EMAIL PROTECTED] Jeff, do you want to merge in Roland's 4 patches to your tree then do a sync with me so I can pull it all in from you? Grabbing them now... ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/4] IPoIB: Fix unused variable warning
Roland Dreier wrote: The conversion to use netdevice internal stats left an unused variable in ipoib_neigh_free(), since there's no longer any reason to get netdev_priv() in order to increment dropped packets. Delete the unused priv variable. Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- drivers/infiniband/ulp/ipoib/ipoib_main.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) applied 1-4 ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
From: Andi Kleen [EMAIL PROTECTED] Date: Wed, 10 Oct 2007 02:37:16 +0200 On Tue, Oct 09, 2007 at 05:04:35PM -0700, David Miller wrote: We have to keep in mind, however, that the sw queue right now is 1000 packets. I heavily discourage any driver author to try and use any single TX queue of that size. Why would you discourage them? If 1000 is ok for a software queue why would it not be ok for a hardware queue? Because with the software queue, you aren't accessing 1000 slots shared with the hardware device which does shared-ownership transactions on those L2 cache lines with the cpu. Long ago I did a test on gigabit on a cpu with only 256K of L2 cache. Using a smaller TX queue make things go faster, and it's exactly because of these L2 cache effects. 1000 packets is a lot. I don't have hard data, but gut feeling is less would also do. I'll try to see how backlogged my 10Gb tests get when a strong sender is sending to a weak receiver. And if the hw queues are not enough a better scheme might be to just manage this in the sockets in sendmsg. e.g. provide a wait queue that drivers can wake up and let them block on more queue. TCP does this already, but it operates in a lossy manner. I don't really see the advantage over the qdisc in that scheme. It's certainly not simpler and probably more code and would likely also not require less locks (e.g. a currently lockless driver would need a new lock for its sw queue). Also it is unclear to me it would be really any faster. You still need a lock to guard hw TX enqueue from hw TX reclaim. A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you increase the size much more performance starts to go down due to L2 cache thrashing. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On Tue, Oct 09, 2007 at 05:04:35PM -0700, David Miller wrote: We have to keep in mind, however, that the sw queue right now is 1000 packets. I heavily discourage any driver author to try and use any single TX queue of that size. Why would you discourage them? If 1000 is ok for a software queue why would it not be ok for a hardware queue? Which means that just dropping on back pressure might not work so well. Or it might be perfect and signal TCP to backoff, who knows! :-) 1000 packets is a lot. I don't have hard data, but gut feeling is less would also do. And if the hw queues are not enough a better scheme might be to just manage this in the sockets in sendmsg. e.g. provide a wait queue that drivers can wake up and let them block on more queue. The idea is that the network stack, as in the pure hw queue scheme, unconditionally always submits new packets to the driver. Therefore even if the hw TX queue is full, the driver can still queue to an internal sw queue with some limit (say 1000 for ethernet, as is used now). When the hw TX queue gains space, the driver self-batches packets from the sw queue to the hw queue. I don't really see the advantage over the qdisc in that scheme. It's certainly not simpler and probably more code and would likely also not require less locks (e.g. a currently lockless driver would need a new lock for its sw queue). Also it is unclear to me it would be really any faster. -Andi ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch
I do agree that the memory limit just seems arbitrary and we can probably do away with that. We discussed this previously and had agreed upon limiting the memory foot print to 1GB by default. This module parameter was for larger systems that had plenty of memory and could afford to use more. This way the sys admin could increase the limit. Hence I am not really in favour of removing this. Pradeep ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] SDP ?
Hi Jim, Thanks, tried early on with -D_XPG4_2, things went from bad to worse, I'll look at switching from int to void. Jim // Jim Mott wrote: That should work fine. You might be able to build with -D_XPG4_2 as well. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jim Langston Sent: Tuesday, October 09, 2007 10:13 AM To: general@lists.openfabrics.org Subject: [ofa-general] SDP ? Hi all, I'm working on porting SDP to OpenSolaris and am looking at a compile error that I get. Essentially, I have a conflict of types on the compile: bash-3.00$ /opt/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I. -I.. -g -D_POSIX_PTHREAD_SEMANTICS -DSYSCONFDIR=\/usr/local/etc\ -g -D_POSIX_PTHREAD_SEMANTICS -c port.c -KPIC -DPIC -o .libs/port.o port.c, line 1896: identifier redeclared: getsockname current : function(int, pointer to struct sockaddr {unsigned short sa_family, array[14] of char sa_data}, pointer to unsigned int) returning int previous: function(int, pointer to struct sockaddr {unsigned short sa_family, array[14] of char sa_data}, pointer to void) returning int : /usr/include/sys/socket.h, line 436 Line 436 in /usr/include/sys/socket.h extern int getsockname(int, struct sockaddr *_RESTRICT_KYWD, Psocklen_t); and Psocklen_t #if defined(_XPG4_2) || defined(_BOOT) typedef socklen_t *_RESTRICT_KYWD Psocklen_t; #else typedef void*_RESTRICT_KYWD Psocklen_t; #endif /* defined(_XPG4_2) || defined(_BOOT) */ Do I need to change port.c getsockname to type void * ? Thanks, Jim ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- / Jim Langston Sun Microsystems, Inc. (877) 854-5583 (AccessLine) AIM: jl9594 [EMAIL PROTECTED] ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 01/23] IB/ipath -- iba6110 rev4 GPIO counters support
hi roland, i didn't realize it was such a PITA for you to take so many at once. i'll make sure to do them in smaller chunks from now on. thanks for taking these... arthur On Tue, Oct 09, 2007 at 02:55:31PM -0700, Roland Dreier wrote: OK, I'll grudgingly merge these patch, even though they all arrived on the exact day that Linus released 2.6.23... but you guys really need to fix your development process so you don't accumulate a huge bolus of patches that you then vomit out. In the future I'm not going to accept giant merges like this -- please send your patches as soon as you've accumulated say 5 or 10. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 01/23] IB/ipath -- iba6110 rev4 GPIO counters support
hi roland, i didn't realize it was such a PITA for you to take so many at once. i'll make sure to do them in smaller chunks from now on. Thanks. The reason its a pain is that it's a lot harder to review a ton of patches when they come late like this. Just send the patches as you write them and you have less of a queue to worry about and I can manage my queue a lot better. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] IPoIB CM (NOSRQ) [Patch V9] revised
This revised version incorporates Sean's comments. The module parameters are unchanged except the restriction on max_rc_qp (that it should be power of 2) has been removed. This patch has been tested with linux-2.6.23-rc7 (derived from Roland's for-2.6.24 git tree) on ppc64 machines using IBM HCA. Signed-off-by: Pradeep Satyanarayana [EMAIL PROTECTED] --- --- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-03 12:01:58.0 -0500 +++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-09 19:42:51.0 -0500 @@ -69,6 +69,7 @@ enum { IPOIB_TX_RING_SIZE= 64, IPOIB_MAX_QUEUE_SIZE = 8192, IPOIB_MIN_QUEUE_SIZE = 2, + IPOIB_MAX_RC_QP = 4096, IPOIB_NUM_WC = 4, @@ -95,11 +96,13 @@ enum { IPOIB_MCAST_FLAG_ATTACHED = 3, }; +#define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE)) #defineIPOIB_OP_RECV (1ul 31) + #ifdef CONFIG_INFINIBAND_IPOIB_CM -#defineIPOIB_CM_OP_SRQ (1ul 30) +#defineIPOIB_CM_OP_RECV (1ul 30) #else -#defineIPOIB_CM_OP_SRQ (0) +#defineIPOIB_CM_OP_RECV (0) #endif /* structs */ @@ -186,11 +189,14 @@ enum ipoib_cm_state { }; struct ipoib_cm_rx { - struct ib_cm_id *id; - struct ib_qp*qp; - struct list_head list; - struct net_device *dev; - unsigned longjiffies; + struct ib_cm_id *id; + struct ib_qp*qp; + struct ipoib_cm_rx_buf *rx_ring; /* Used by no srq only */ + struct list_head list; + struct net_device *dev; + unsigned longjiffies; + u32 index; /* wr_ids are distinguished by index +* to identify the QP -no srq only */ enum ipoib_cm_state state; }; @@ -235,6 +241,8 @@ struct ipoib_cm_dev_priv { struct ib_wcibwc[IPOIB_NUM_WC]; struct ib_sge rx_sge[IPOIB_CM_RX_SG]; struct ib_recv_wr rx_wr; + struct ipoib_cm_rx **rx_index_table; /* See ipoib_cm_dev_init() + *for usage of this element */ }; /* @@ -458,6 +466,7 @@ void ipoib_drain_cq(struct net_device *d /* We don't support UC connections at the moment */ #define IPOIB_CM_SUPPORTED(ha) (ha[0] (IPOIB_FLAGS_RC)) +extern int max_rc_qp; static inline int ipoib_cm_admin_enabled(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); --- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-31 12:14:30.0 -0500 +++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-10-09 21:15:25.0 -0500 @@ -49,6 +49,18 @@ MODULE_PARM_DESC(cm_data_debug_level, #include ipoib.h +int max_rc_qp = 128; +static int max_recv_buf = 1024; /* Default is 1024 MB */ + +module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0444); +MODULE_PARM_DESC(nosrq_max_rc_qp, Max number of no srq RC QPs supported); + +module_param_named(max_receive_buffer, max_recv_buf, int, 0644); +MODULE_PARM_DESC(max_receive_buffer, Max Receive Buffer Size in MB); + +static atomic_t current_rc_qp = ATOMIC_INIT(0); /* Active number of RC QPs for no srq */ + +#define NOSRQ_INDEX_MASK (0xfff) /* This corresponds to a max of 4096 QPs for no srq */ #define IPOIB_CM_IETF_ID 0x1000ULL #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ) @@ -81,20 +93,21 @@ static void ipoib_cm_dma_unmap_rx(struct ib_dma_unmap_single(priv-ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); } -static int ipoib_cm_post_receive(struct net_device *dev, int id) +static int post_receive_srq(struct net_device *dev, u64 id) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_recv_wr *bad_wr; int i, ret; - priv-cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; + priv-cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; for (i = 0; i IPOIB_CM_RX_SG; ++i) priv-cm.rx_sge[i].addr = priv-cm.srq_ring[id].mapping[i]; ret = ib_post_srq_recv(priv-cm.srq, priv-cm.rx_wr, bad_wr); if (unlikely(ret)) { - ipoib_warn(priv, post srq failed for buf %d (%d)\n, id, ret); + ipoib_warn(priv, post srq failed for buf %lld (%d)\n, + (unsigned long long)id, ret); ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, priv-cm.srq_ring[id].mapping); dev_kfree_skb_any(priv-cm.srq_ring[id].skb); @@ -104,12 +117,47 @@ static int ipoib_cm_post_receive(struct return ret; } -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, +static int post_receive_nosrq(struct net_device *dev, u64 id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_recv_wr *bad_wr; + int i, ret; + u32 index;
[ofa-general] nightly osm_sim report 2007-10-10:normal completion
OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-10-09 OpenSM git rev = Tue_Oct_2_22:28:56_2007 [d5c34ddc158599abff9f09a6cc6c8cad67745f0b] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general