Re: [openib-general] [RFC] IB/ipoib: Asynchronous events delivered without port parameter.
Quoting Moni Levy [EMAIL PROTECTED]: Subject: [RFC] IB/ipoib: Asynchronous events delivered without port parameter. Hello, I did a short code review of the ipoib code concentrating on partitioning support and I mentioned that the asynchronous events handler in the ipoib code does not take the port number reported in the event record into consideration. The effect of that is that all of the ib# devices related to that specific HCA are flushed when it seems to me that only the relevant port one should be. Is that done on purpose, or am I missing something ? Thanks, Moni p.s. I'm working on a patch that should solve another issue caused by PKEY reordering ipoib behavior and the above issue further complicates things for me. If true, why is this a problem? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC] [PATCH v2] IB/ipoib: Add bonding support to IPoIB
Quoting Moni Shoua [EMAIL PROTECTED]: Subject: Re: [RFC] [PATCH v2] IB/ipoib: Add bonding support to IPoIB Thanks for the comments To fix it, this patch adds a dev field to struct ipoib_neigh which is used instead of the struct neighbour dev one. It seems that in this design, if multiple ipoib interfaces are present, we might get an skb such that skb-dev will be different from the new dev field in struct ipoib_neigh. It seems that the result will be that the packet will be sent on a wrong interface. Right? I don't see how. The field dev in ipoib_neigh doesn't take part in interface selection. As I see it, skb travels this path: 1. Passed to bond_dev-hard_start_xmit 2. bond_dev-hard_start_xmit chooses the current active interface, changes skb-dev and enqueues it back for xmittig. ipoib_neigh ah field includes struct ib_ah *. This selects important parameters which depend on both packet source and destination interfaces. I think the right thing might be to compare ipoib_neigh dev and skb-dev, and destroy ipoib_neigh if these do not match. In addition, if an IPoIB device is removed before bonding is unloaded it may cause bond0 neighbours (neighbours that point to bond0) to exist after the IPoIB device no longer exist. This is why a neighbour cleanup is required during device cleanup. This cleanup scans the arp cache and the ndisc cache to find there neighbours of bond0 which refer also to the relevant ibX. Also, when ib_ipoib module is unloaded, the neighbour destructor must be set to NULL because the neighbour function is in ib_ipoib. For this neigh table cleanup, it is required to export the symbol nd_tbl just like the symbol arp_tbl is. I wonder about this: is it really true that any allocated neighbour is always in either arp_tbl or nd_tbl? For example, could some code have called neigh_hold and retained a neighbour that is not in either one of these tables? I got the assumption about neighbours living in one of these 2 tables from observation and code reading. I preferred that that on keeping track of all ipoib_neighs and putting them in a list. However, I could do that instead of neigh_table scanning. Do you think it's better? If some neighbours are not on any tables, it seems using our own lists (e.g. lists we have in ipoib_path) is the only option, no? For the example... I didn't understand it. Could you please explain? grep for neigh_hold. neighbour is only destroyed when ref count goes to 0. If some code does neigh_hold, it seems neighbour could be removed from table but destructor not yet called. During my tests I found that when running 1. modprobe -r ib_mthca (to delete IPoIB interfaces) 2. ping somewhere on the subnet of bond0 I get this stack dump (which ends with kernel death) [8037ff32] skb_under_panic+0x5c/0x60 [882e00c2] :ib_ipoib:ipoib_hard_header+0xa6/0xc0 [803c3c98] arp_create+0x120/0x226 [803c3dc3] arp_send+0x25/0x3b [803c466a] arp_solicit+0x186/0x195 [8038c0ac] neigh_timer_handler+0x2b5/0x309 [8038bdf7] neigh_timer_handler+0x0/0x309 [80239599] run_timer_softirq+0x130/0x19e [80235fcc] __do_softirq+0x55/0xc3 [8020acac] call_softirq+0x1c/0x28 [8020c02b] do_softirq+0x2c/0x7d [8021864a] smp_apic_timer_interrupt+0x57/0x6a [80208e19] mwait_idle+0x0/0x45 [8020a756] apic_timer_interrupt+0x66/0x70 EOI [80208e5b] mwait_idle+0x42/0x45 [80208db1] cpu_idle+0x8b/0xae [80217d60] start_secondary+0x47f/0x48f The only way I found to avoid this (for now) is to check skb headroom in ipoib_hard_header. I guess that this safety check doesn't harm regular IPoIB operation and it seems to solve my problem. However, I would be happy to hear what others think of this last issue. As I said, this seems to indicate a problem in the bonding code. But what will happen after you error out in ipoib_hard_header? Is the packet dropped? What might break as a result? I will check the hard_header_len issue in the bonding code more carefully. From first look it seems that bonding does borrow the hard_header_len. So where does a shorter message come from? Also, my checks show that it is safe to return with error from hard_header(). For example, in neigh_connected_output: err = dev-hard_header(skb, dev, ntohs(skb-protocol), neigh-ha, NULL, skb-len); read_unlock_bh(neigh-lock); if (err = 0) err = neigh-ops-queue_xmit(skb); else { err = -EINVAL; kfree_skb(skb); I would really appreciate comments. thanks -MoniS -- MST ___ openib-general
Re: [openib-general] [PATCHv2] IB/ipoib: Fix ipoib handling for pkey reordering
I just gave this a cursory glance. A suggestion: would it not be much simpler to modify the QP from RTS to RTS on pkey change? diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index f2aa923..b0287c1 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -415,21 +415,22 @@ int ipoib_ib_dev_open(struct net_device ret = ipoib_init_qp(dev); if (ret) { - ipoib_warn(priv, ipoib_init_qp returned %d\n, ret); + if (ret != -ENOENT) + ipoib_warn(priv, ipoib_init_qp returned %d\n, ret); return -1; } What's the reason for this? @@ -993,6 +993,7 @@ static void ipoib_setup(struct net_devic INIT_DELAYED_WORK(priv-pkey_task,ipoib_pkey_poll); INIT_DELAYED_WORK(priv-mcast_task, ipoib_mcast_join_task); INIT_WORK(priv-flush_task, ipoib_ib_dev_flush); + INIT_WORK(priv-flush_restart_qp_task, ipoib_ib_dev_flush_restart_qp); INIT_WORK(priv-restart_task, ipoib_mcast_restart_task); INIT_DELAYED_WORK(priv-ah_reap_task, ipoib_reap_ah); } Shorter name? diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index b303ce6..27d6fd4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -232,9 +232,10 @@ static int ipoib_mcast_join_finish(struc ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast-mcmember.mlid), mcast-mcmember.mgid); if (ret 0) { - ipoib_warn(priv, couldn't attach QP to multicast group -IPOIB_GID_FMT \n, -IPOIB_GID_ARG(mcast-mcmember.mgid)); + if (ret != -ENXIO) /* No PKEY found */ + ipoib_warn(priv, couldn't attach QP to multicast group +IPOIB_GID_FMT \n, +IPOIB_GID_ARG(mcast-mcmember.mgid)); clear_bit(IPOIB_MCAST_FLAG_ATTACHED, mcast-flags); return ret; @@ -312,7 +313,7 @@ ipoib_mcast_sendonly_join_complete(int s status = ipoib_mcast_join_finish(mcast, multicast-rec); if (status) { - if (mcast-logcount++ 20) + if (mcast-logcount++ 20 status != -ENXIO) ipoib_dbg_mcast(netdev_priv(dev), multicast join failed for IPOIB_GID_FMT , status %d\n, IPOIB_GID_ARG(mcast-mcmember.mgid), status); @@ -416,7 +417,7 @@ static int ipoib_mcast_join_complete(int , status %d\n, IPOIB_GID_ARG(mcast-mcmember.mgid), status); - } else { + } else if (status != -ENXIO) { ipoib_warn(priv, multicast join failed for IPOIB_GID_FMT , status %d\n, IPOIB_GID_ARG(mcast-mcmember.mgid), diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 3cb551b..d0384ea 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -52,8 +52,10 @@ int ipoib_mcast_attach(struct net_device if (ib_find_cached_pkey(priv-ca, priv-port, priv-pkey, pkey_index)) { clear_bit(IPOIB_PKEY_ASSIGNED, priv-flags); ret = -ENXIO; + ipoib_dbg(priv, PKEY %X not found\n, priv-pkey); goto out; } + ipoib_dbg(priv, PKEY %X found at index %d\n, priv-pkey, pkey_index); set_bit(IPOIB_PKEY_ASSIGNED, priv-flags); /* set correct QKey for QP */ Make it PKey or pkey: no text in uppercase in log messages please. @@ -105,9 +107,11 @@ int ipoib_init_qp(struct net_device *dev */ ret = ib_find_cached_pkey(priv-ca, priv-port, priv-pkey, pkey_index); if (ret) { + ipoib_dbg(priv, PKEY %X not found.\n, priv-pkey); clear_bit(IPOIB_PKEY_ASSIGNED, priv-flags); return ret; } + ipoib_dbg(priv, PKEY %X found at index %d.\n, priv-pkey, pkey_index); set_bit(IPOIB_PKEY_ASSIGNED, priv-flags); qp_attr.qp_state = IB_QPS_INIT; going a bit overboard on the number of debug messages here. @@ -260,12 +264,14 @@ void ipoib_event(struct ib_event_handler container_of(handler, struct ipoib_dev_priv, event_handler); if (record-event == IB_EVENT_PORT_ERR|| - record-event == IB_EVENT_PKEY_CHANGE || record-event == IB_EVENT_PORT_ACTIVE || record-event == IB_EVENT_LID_CHANGE ||
Re: [openib-general] [PATCHv2] IB/ipoib: Fix ipoib handling for pkey reordering
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: [PATCHv2] IB/ipoib: Fix ipoib handling for pkey reordering I just gave this a cursory glance. I haven't really read it except to think why is this so complicated? A suggestion: would it not be much simpler to modify the QP from RTS to RTS on pkey change? Changing the P_Key index is not allowed for RTS-RTS. You would have to modify the QP RTS-SQD, wait for the SQ to drain, then modify the P_Key index with SQD-SQD, and finally go SQD-RTS. True, I misread the spec. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC] IB/ipoib: Asynchronous events delivered without port parameter.
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: [RFC] IB/ipoib: Asynchronous events delivered without port parameter. I did a short code review of the ipoib code concentrating on partitioning support and I mentioned that the asynchronous events handler in the ipoib code does not take the port number reported in the event record into consideration. The effect of that is that all of the ib# devices related to that specific HCA are flushed when it seems to me that only the relevant port one should be. Is that done on purpose, or am I missing something ? I don't think there's any particular reason the code is that way except for the oversight never being corrected. But it looks trivial to fix, like the patch below. Does that look right to you? p.s. I'm working on a patch that should solve another issue caused by PKEY reordering ipoib behavior and the above issue further complicates things for me. Why not fix the issue first then? commit a27cbe878203076247c1b5287f5ab59ed143b560 Author: Roland Dreier [EMAIL PROTECTED] Date: Tue Feb 27 07:37:49 2007 -0800 IPoIB: Only handle async events for one port An asynchronous event carries the port number that the event occurred on, so there's no reason for an IPoIB interface to process an event associated with a different local HCA port. Signed-off-by: Roland Dreier [EMAIL PROTECTED] diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 3cb551b..7f3ec20 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -259,12 +259,13 @@ void ipoib_event(struct ib_event_handler *handler, struct ipoib_dev_priv *priv = container_of(handler, struct ipoib_dev_priv, event_handler); - if (record-event == IB_EVENT_PORT_ERR|| - record-event == IB_EVENT_PKEY_CHANGE || - record-event == IB_EVENT_PORT_ACTIVE || - record-event == IB_EVENT_LID_CHANGE || - record-event == IB_EVENT_SM_CHANGE || - record-event == IB_EVENT_CLIENT_REREGISTER) { + if ((record-event == IB_EVENT_PORT_ERR|| + record-event == IB_EVENT_PKEY_CHANGE || + record-event == IB_EVENT_PORT_ACTIVE || + record-event == IB_EVENT_LID_CHANGE || + record-event == IB_EVENT_SM_CHANGE || + record-event == IB_EVENT_CLIENT_REREGISTER) + record-element.port_num == priv-port) { ipoib_dbg(priv, Port state change event\n); queue_work(ipoib_workqueue, priv-flush_task); } Looks good. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] for OFED 1.2
Quoting Sean Hefty [EMAIL PROTECTED]: Subject: Re: [PATCH] for OFED 1.2 Please send patches that will be added to kernel_patches/fixes. Please update your git tree from git://git.openfabrics.org/~vlad/ofed_1_2/.git ofed_1_2 You want me to create a patch that adds a file that contains the actual patches? Why not apply the patches directly? That's the ofed structure, this was discussed multiple times already. The point is to keep all changes to upstream components separate, to make updating to upstream kernel trivial in the future. Worked quite well for OFED 1.1 - 1.2 transition. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] for OFED 1.2
Quoting Sean Hefty [EMAIL PROTECTED]: Subject: Re: [PATCH] for OFED 1.2 Yes, actual patches should be created under kernel_patches/fixes. Please update your git tree because the following patch fails: Can you explain how the patch fails? I don't see how putting the patch into a file helps. Try applying it? Why not apply the patches directly? To be consistent with 2.6.20 kernel. You can check out stock 2.6.20 using a tag. Why maintain the ofed code in git if you don't use it to track patches? Basically so that conflicts in future merges from upstream are easy to resolve. If you like, let's reopen this for 1.3. We are after freeze in OFED 1.2. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] for OFED 1.2
Quoting Steve Wise [EMAIL PROTECTED]: Subject: Re: [openib-general] [PATCH] for OFED 1.2 On Tue, 2007-02-27 at 18:55 +0200, Michael S. Tsirkin wrote: Quoting Sean Hefty [EMAIL PROTECTED]: Subject: Re: [PATCH] for OFED 1.2 Please send patches that will be added to kernel_patches/fixes. Please update your git tree from git://git.openfabrics.org/~vlad/ofed_1_2/.git ofed_1_2 You want me to create a patch that adds a file that contains the actual patches? Why not apply the patches directly? That's the ofed structure, this was discussed multiple times already. The point is to keep all changes to upstream components separate, to make updating to upstream kernel trivial in the future. Worked quite well for OFED 1.1 - 1.2 transition. Having these patches as files is painful for every developer because they cannot create a patch against ofed_1_2/drivers/infiniband/* nor the kernel.org upstream tree. Did you try using quilt which makes managing patch stacks quite easy? If you have quilt installed, OFED scripts actually use it to apply patches, so things are easy. They need to apply all the current patches and then create a patch on top of that. Or hope the patch applies fuzzily. One point I can't stress enough: whatever way you create a patch, developers are expected to build and test it in OFED environment before posting. I think with stacked git or just git and rebasing at key times, you could keep an ofed_1_2 tree that folks can easily apply patches to... Its too late to change this for 1.2, but you might want to reconsider the design for 1.3. Well, I experimented with git rebase and it is unfortunately still fragile at this point. I agree using stacked git might be a good idea, I just did not have the chance to experiment with it enough. I had an impression that publishing stg managed branch creates problems for whoever attempts to track it, but I might be wrong. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] for OFED 1.2
Quoting Sean Hefty [EMAIL PROTECTED]: Subject: Re: [PATCH] for OFED 1.2 I think with stacked git or just git and rebasing at key times, you could keep an ofed_1_2 tree that folks can easily apply patches to... Its too late to change this for 1.2, but you might want to reconsider the design for 1.3. Can't we just create a new branch (ofed_1_2_patched) with these patches already applied and in the correct order? Then what we do when we want to update to new upstream? Throw this branch away? As it is, I just pull then build and remove patches that conflict. By the way, there are backport patches, etc - it is still incorrect to say that you would be able to generate a patch out of git and know it's a good one without test-build. Maybe I'm just not understanding the work flow here... Sean, please install quilt and try using it for working with the system. Adding new patch is usually done in this way quilt new patch quilt add files edit quilt refresh cp patches/patch kernel_patches/fixes/ git add kernel_patches/fixes/patch git commit kernel_patches/fixes/patch -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] for OFED 1.2
Lot's of stuff *is* in wiki already - did you look at pages Vlad created? Things can always be improved, you can add stuff too. Quoting Jeff Squyres [EMAIL PROTECTED]: Subject: Re: [PATCH] for OFED 1.2 It would be great if all of this knowledge is posted to the wiki to avoid repeating this conversation in the future (or one of countless variations of this conversation). For example, I admit to not paying close attention to many of the threads on this list, but this was the first time I'd head of quilt. Specifically: if there are tools and methods that are helpful for OFA/ OFED development, they should be detailed on the wiki. The wiki is where all permanent knowledge should be posted. This is just my $0.01... On Feb 27, 2007, at 12:31 PM, Michael S. Tsirkin wrote: Quoting Steve Wise [EMAIL PROTECTED]: Subject: Re: [openib-general] [PATCH] for OFED 1.2 On Tue, 2007-02-27 at 18:55 +0200, Michael S. Tsirkin wrote: Quoting Sean Hefty [EMAIL PROTECTED]: Subject: Re: [PATCH] for OFED 1.2 Please send patches that will be added to kernel_patches/fixes. Please update your git tree from git://git.openfabrics.org/~vlad/ofed_1_2/.git ofed_1_2 You want me to create a patch that adds a file that contains the actual patches? Why not apply the patches directly? That's the ofed structure, this was discussed multiple times already. The point is to keep all changes to upstream components separate, to make updating to upstream kernel trivial in the future. Worked quite well for OFED 1.1 - 1.2 transition. Having these patches as files is painful for every developer because they cannot create a patch against ofed_1_2/drivers/infiniband/* nor the kernel.org upstream tree. Did you try using quilt which makes managing patch stacks quite easy? If you have quilt installed, OFED scripts actually use it to apply patches, so things are easy. They need to apply all the current patches and then create a patch on top of that. Or hope the patch applies fuzzily. One point I can't stress enough: whatever way you create a patch, developers are expected to build and test it in OFED environment before posting. I think with stacked git or just git and rebasing at key times, you could keep an ofed_1_2 tree that folks can easily apply patches to... Its too late to change this for 1.2, but you might want to reconsider the design for 1.3. Well, I experimented with git rebase and it is unfortunately still fragile at this point. I agree using stacked git might be a good idea, I just did not have the chance to experiment with it enough. I had an impression that publishing stg managed branch creates problems for whoever attempts to track it, but I might be wrong. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/ openib-general -- Jeff Squyres Server Virtualization Business Unit Cisco Systems ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] for OFED 1.2
This is just my $0.01... Thanks for the suggestions, but what does $0.01 buy one in US today? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] for OFED 1.2
Quoting Steve Wise [EMAIL PROTECTED]: Subject: Re: [PATCH] for OFED 1.2 On Tue, 2007-02-27 at 19:44 +0200, Michael S. Tsirkin wrote: Quoting Sean Hefty [EMAIL PROTECTED]: Subject: Re: [PATCH] for OFED 1.2 I think with stacked git or just git and rebasing at key times, you could keep an ofed_1_2 tree that folks can easily apply patches to... Its too late to change this for 1.2, but you might want to reconsider the design for 1.3. Can't we just create a new branch (ofed_1_2_patched) with these patches already applied and in the correct order? Then what we do when we want to update to new upstream? Throw this branch away? As it is, I just pull then build and remove patches that conflict. By the way, there are backport patches, etc - it is still incorrect to say that you would be able to generate a patch out of git and know it's a good one without test-build. Maybe I'm just not understanding the work flow here... Sean, please install quilt and try using it for working with the system. Adding new patch is usually done in this way quilt new patch quilt add files edit quilt refresh cp patches/patch kernel_patches/fixes/ git add kernel_patches/fixes/patch git commit kernel_patches/fixes/patch NOTE: The key to the above process is the assumption that the developer maintains _all_ of the existing patches from kernel_patches/ on top of the ofed_1_2 tree using quilt or stg. Otherwise quilt/stg isn't buying you anything. OFED will do this automatically. And this doesn't take into account backports. The process works with backport patches too: you just have to do this quilt pop -a quilt new patch quilt add files edit quilt refresh quilt push -a -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] for OFED 1.2
This is just my $0.01... It buys very little, if anything. In fact, a whole $0.02 also buys very little, if anything. So take my comments for what they're worth. Oh, good, I thought deflation is getting out of hand ... -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] for OFED 1.2
Lot's of stuff *is* in wiki already - did you look at pages Vlad created? A search for quilt on the wiki turns up nothing (I checked before I posted :-) ). And yes, I have [thoroughly] read the pages Vlad created. But the very fact that this conversation is occurring is because either the information is not on the wiki or what is on the wiki is not clear. Otherwise, I suspect that you simply would have pointed Steve to the wiki and said Please read the fine manual at http://;. You are right in that, I don't disclaim it. Thanks for the suggestion, I'll try to find the time to add this to wiki. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] for OFED 1.2
Quoting Steve Wise [EMAIL PROTECTED]: Subject: Re: [PATCH] for OFED 1.2 Sean, please install quilt and try using it for working with the system. Adding new patch is usually done in this way quilt new patch quilt add files edit quilt refresh cp patches/patch kernel_patches/fixes/ git add kernel_patches/fixes/patch git commit kernel_patches/fixes/patch NOTE: The key to the above process is the assumption that the developer maintains _all_ of the existing patches from kernel_patches/ on top of the ofed_1_2 tree using quilt or stg. Otherwise quilt/stg isn't buying you anything. OFED will do this automatically. uh, can you explain this? Given I have a freshly cloned ofed_1_2 git tree, and I want to change cma.c (a good one cuz there are patches). What do I do? There's no quilt stack at all at this point. Right? Try running the configure script. After this, quilt applied will show what patches are applied. And this doesn't take into account backports. The process works with backport patches too: you just have to do this quilt pop -a quilt new patch quilt add files edit quilt refresh quilt push -a But you cannot keep a stack for more than one backport pushed, right? So you still need to be slapping the stacks of patches around for each backport. Or maybe I'm confused? Yes. Fortunately it's not too hard: you can do quilt pop -a and re-run configure for another kernel. Of course for testing the patch, it is easier to commit the change in your tree and then to use openfabrics cross-build functionality that will clone this tree and build for multiple arches/kernels. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] for OFED 1.2
Quoting Sean Hefty [EMAIL PROTECTED]: Subject: RE: [PATCH] for OFED 1.2 But you cannot keep a stack for more than one backport pushed, right? So you still need to be slapping the stacks of patches around for each backport. Why not have separate branches for each kernels too? I think it'll be much more work to maintain all these branches. And again, there will be conflicts, and it's too easy to get confused when resolving a conflict. With patches we have scripts to automate this. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] for OFED 1.2
Quoting Sasha Khapyorsky [EMAIL PROTECTED]: Subject: Re: [PATCH] for OFED 1.2 On 19:44 Tue 27 Feb , Michael S. Tsirkin wrote: Quoting Sean Hefty [EMAIL PROTECTED]: Subject: Re: [PATCH] for OFED 1.2 I think with stacked git or just git and rebasing at key times, you could keep an ofed_1_2 tree that folks can easily apply patches to... Its too late to change this for 1.2, but you might want to reconsider the design for 1.3. Can't we just create a new branch (ofed_1_2_patched) with these patches already applied and in the correct order? Then what we do when we want to update to new upstream? Throw this branch away? As it is, I just pull then build and remove patches that conflict. You can save this branch as branch-name-upstream-name (or better) and to rebase branch-name to the new upstream. rebase does not seem to be too robust when run on such a large repository as the linux kernel. Maybe stacked git will work. By the way, there are backport patches, etc - it is still incorrect to say that you would be able to generate a patch out of git and know it's a good one without test-build. In similar way you can track backport patch sets as branches. At the moment it seems like a lot of work. Again, maybe stg makes it easy, I know it's hard with plain git. And I think lots of people (including me) will be confused if we have a ton of branches. Maybe I'm just not understanding the work flow here... Sean, please install quilt and try using it for working with the system. Adding new patch is usually done in this way quilt new patch quilt add files edit quilt refresh cp patches/patch kernel_patches/fixes/ git add kernel_patches/fixes/patch git commit kernel_patches/fixes/patch This looks strange for me to track patches against patches... One gets used to it :) Seriously, we have these patches, and we want to version them together with source they are intended to apply to. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Fwd: [ANNOUNCE] GIT 1.5.0.2
FYI. - Forwarded message from Junio C Hamano [EMAIL PROTECTED] - Subject: [ANNOUNCE] GIT 1.5.0.2 Date: Tue, 27 Feb 2007 10:58:22 +0200 In-Reply-To: [EMAIL PROTECTED] (Junio C. Hamano'smessage of Sun, 18 Feb 2007 18:07:42 -0800) References: [EMAIL PROTECTED] From: Junio C Hamano [EMAIL PROTECTED] The latest maintenance release GIT 1.5.0.2 is available at the usual places: http://www.kernel.org/pub/software/scm/git/ git-1.5.0.2.tar.{gz,bz2} (tarball) git-htmldocs-1.5.0.2.tar.{gz,bz2} (preformatted docs) git-manpages-1.5.0.2.tar.{gz,bz2} (preformatted docs) RPMS/$arch/git-*-1.5.0.2-1.$arch.rpm (RPM) GIT v1.5.0.2 Release Notes == Fixes since v1.5.0.1 * Bugfixes - Automated merge conflict handling when changes to symbolic links conflicted were completely broken. The merge-resolve strategy created a regular file with conflict markers in it in place of the symbolic link. The default strategy, merge-recursive was even more broken. It removed the path that was pointed at by the symbolic link. Both of these problems have been fixed. - 'git diff maint master next' did not correctly give combined diff across three trees. - 'git fast-import' portability fix for Solaris. - 'git show-ref --verify' without arguments did not error out but segfaulted. - 'git diff :tracked-file `pwd`/an-untracked-file' gave an extra slashes after a/ and b/. - 'git format-patch' produced too long filenames if the commit message had too long line at the beginning. - Running 'make all' and then without changing anything running 'make install' still rebuilt some files. This was inconvenient when building as yourself and then installing as root (especially problematic when the source directory is on NFS and root is mapped to nobody). - 'git-rerere' failed to deal with two unconflicted paths that sorted next to each other. - 'git-rerere' attempted to open(2) a symlink and failed if there was a conflict. Since a conflicting change to a symlink would not benefit from rerere anyway, the command now ignores conflicting changes to symlinks. - 'git-repack' did not like to pass more than 64 arguments internally to underlying 'rev-list' logic, which made it impossible to repack after accumulating many (small) packs in the repository. - 'git-diff' to review the combined diff during a conflicted merge were not reading the working tree version correctly when changes to a symbolic link conflicted. It should have read the data using readlink(2) but read from the regular file the symbolic link pointed at. - 'git-remote' did not like period in a remote's name. * Documentation updates - added and clarified core.bare, core.legacyheaders configurations. - updated git-clone --depth documentation. * Assorted git-gui fixes. - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - End forwarded message - -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
Quoting Shirley Ma [EMAIL PROTECTED]: Subject: Re: [openib-general] IPOIB NAPI Roland Dreier [EMAIL PROTECTED] wrote on 02/27/2007 02:41:44 PM: So the IBV_CQ_REPORT_MISSED_EVENTS has been part of OFED-1.2 already? I can generate the patch for all ULPs to use this for review. Do you need me to do that? No, it's not in OFED 1.2 or the upstream kernel. And no one has implemented it for userspace (and I'm somewhat reluctant to break the ABI at this point without some performance numbers to motivate making this API change). Have the NAPI performance problems with ehca been resolved? We could probably merge IPoIB NAPI for 2.6.22 then, which would pull in the kernel changes at least. - R. We have addressed the NAPI performance issues with ehca driver. I believe the patches have been upper stream. However the test results show that it's better to delay poll again to next NAPI interval, something like this: poll-cq notify-cq, if missed_event netif_rx_reschedule() return 1 vs. poll-cq, notify-cq, if missed_event netif_rx_reschedule() poll again return 0 It seems ehca delivering packet much faster than other HCAs. So poll again would stay in the loop for many many times. So the above changes doesn't impact other HCAs, I would recommand it. I saw same implementations on other ethernet drivers. I'm confused. Which one is faster? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [Bug 400] OFED 1.2 alpha1 IPoIB HA failover gets QP warnings
ib1: dev_queue_xmit failed to requeue packet ib_mthca :04:00.0: QP 000405 not found in MGM ib1: ib_detach_mcast failed (result = -22) ib1: ipoib_mcast_detach failed (result = -22) Looks like this is related to the multicast change that recently went upstream. So this likely affects upstream IPoIB as well. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[OFA General] Re: IPOIB NAPI
Quoting Shirley Ma [EMAIL PROTECTED]: Subject: Re: IPOIB NAPI oland Dreier [EMAIL PROTECTED] wrote on 02/27/2007 02:41:44 PM: So the IBV_CQ_REPORT_MISSED_EVENTS has been part of OFED-1.2 already? I can generate the patch for all ULPs to use this for review. Do you need me to do that? No, it's not in OFED 1.2 or the upstream kernel. And no one has implemented it for userspace (and I'm somewhat reluctant to break the ABI at this point without some performance numbers to motivate making this API change). Have the NAPI performance problems with ehca been resolved? We could probably merge IPoIB NAPI for 2.6.22 then, which would pull in the kernel changes at least. - R. We have addressed the NAPI performance issues with ehca driver. I believe the patches have been upper stream. However the test results show that it's better to delay poll again to next NAPI interval, something like this: poll-cq notify-cq, if missed_event netif_rx_reschedule() return 1 vs. poll-cq, notify-cq, if missed_event netif_rx_reschedule() poll again return 0 It seems ehca delivering packet much faster than other HCAs. So poll again would stay in the loop for many many times. So the above changes doesn't impact other HCAs, I would recommand it. I saw same implementations on other ethernet drivers. I have not benchmarked this, but actually the return 1 version makes sense to me too: since a new completion was observed after notify-cq, we likely currently have HCA writing new completions into the CQ at a high rate, so it makes sense to delay polling by a few cycles, and reduce the number of interrupts in this way. Right? -- MST ___ general mailing list [EMAIL PROTECTED] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [OFA General] List Address Change Completed
Quoting Lee, Michael Paichi [EMAIL PROTECTED]: Subject: [OFA General] List Address Change Completed This list has been migrated to the new server, lists.openfabrics.org. Please update any address book or filter settings to reflect the new mailing list address. Future messages and replies should be sent to this address: [EMAIL PROTECTED] The new web address for this list is: http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general If you have any questions, please contact me at [EMAIL PROTECTED] Can the subject prefix be made all lower-case, with dash, please? OFA General - ofa-general? Upper case words look like shouting to me, and e.g. exchange rules are limited in coping with spaces. -- MST ___ general mailing list [EMAIL PROTECTED] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC] [PATCH v2] IB/ipoib: Add bonding support to IPoIB
During my tests I found that when running 1. modprobe -r ib_mthca (to delete IPoIB interfaces) 2. ping somewhere on the subnet of bond0 I get this stack dump (which ends with kernel death) [8037ff32] skb_under_panic+0x5c/0x60 [882e00c2] :ib_ipoib:ipoib_hard_header+0xa6/0xc0 [803c3c98] arp_create+0x120/0x226 [803c3dc3] arp_send+0x25/0x3b [803c466a] arp_solicit+0x186/0x195 [8038c0ac] neigh_timer_handler+0x2b5/0x309 [8038bdf7] neigh_timer_handler+0x0/0x309 [80239599] run_timer_softirq+0x130/0x19e [80235fcc] __do_softirq+0x55/0xc3 [8020acac] call_softirq+0x1c/0x28 [8020c02b] do_softirq+0x2c/0x7d [8021864a] smp_apic_timer_interrupt+0x57/0x6a [80208e19] mwait_idle+0x0/0x45 [8020a756] apic_timer_interrupt+0x66/0x70 EOI [80208e5b] mwait_idle+0x42/0x45 [80208db1] cpu_idle+0x8b/0xae [80217d60] start_secondary+0x47f/0x48f The only way I found to avoid this (for now) is to check skb headroom in ipoib_hard_header. I guess that this safety check doesn't harm regular IPoIB operation and it seems to solve my problem. However, I would be happy to hear what others think of this last issue. This seems to mean that hard_header_len is not copied from slave to master device. Right? Maybe that's what needs to be fixed. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC] [PATCH v2] IB/ipoib: Add bonding support to IPoIB
When using the bonding driver, neighbours are created by the net stack on behalf of the bonding (master) device. On the tx flow the bonding code gets an skb such that skb-dev points to the master device, it changes this skb to point on the slave device and calls the slave hard_start_xmit function. Combing these two flows, there is a hole if some code at ipoib (ipoib_neigh_destructor) assumes that for each struct neighbour it gets, n-dev is an ipoib device so for example netdev_priv(n-dev) would be of type struct ipoib_dev_priv. To fix it, this patch adds a dev field to struct ipoib_neigh which is used instead of the struct neighbour dev one. It seems that in this design, if multiple ipoib interfaces are present, we might get an skb such that skb-dev will be different from the new dev field in struct ipoib_neigh. It seems that the result will be that the packet will be sent on a wrong interface. Right? In addition, if an IPoIB device is removed before bonding is unloaded it may cause bond0 neighbours (neighbours that point to bond0) to exist after the IPoIB device no longer exist. This is why a neighbour cleanup is required during device cleanup. This cleanup scans the arp cache and the ndisc cache to find there neighbours of bond0 which refer also to the relevant ibX. Also, when ib_ipoib module is unloaded, the neighbour destructor must be set to NULL because the neighbour function is in ib_ipoib. For this neigh table cleanup, it is required to export the symbol nd_tbl just like the symbol arp_tbl is. I wonder about this: is it really true that any allocated neighbour is always in either arp_tbl or nd_tbl? For example, could some code have called neigh_hold and retained a neighbour that is not in either one of these tables? During my tests I found that when running 1. modprobe -r ib_mthca (to delete IPoIB interfaces) 2. ping somewhere on the subnet of bond0 I get this stack dump (which ends with kernel death) [8037ff32] skb_under_panic+0x5c/0x60 [882e00c2] :ib_ipoib:ipoib_hard_header+0xa6/0xc0 [803c3c98] arp_create+0x120/0x226 [803c3dc3] arp_send+0x25/0x3b [803c466a] arp_solicit+0x186/0x195 [8038c0ac] neigh_timer_handler+0x2b5/0x309 [8038bdf7] neigh_timer_handler+0x0/0x309 [80239599] run_timer_softirq+0x130/0x19e [80235fcc] __do_softirq+0x55/0xc3 [8020acac] call_softirq+0x1c/0x28 [8020c02b] do_softirq+0x2c/0x7d [8021864a] smp_apic_timer_interrupt+0x57/0x6a [80208e19] mwait_idle+0x0/0x45 [8020a756] apic_timer_interrupt+0x66/0x70 EOI [80208e5b] mwait_idle+0x42/0x45 [80208db1] cpu_idle+0x8b/0xae [80217d60] start_secondary+0x47f/0x48f The only way I found to avoid this (for now) is to check skb headroom in ipoib_hard_header. I guess that this safety check doesn't harm regular IPoIB operation and it seems to solve my problem. However, I would be happy to hear what others think of this last issue. As I said, this seems to indicate a problem in the bonding code. But what will happen after you error out in ipoib_hard_header? Is the packet dropped? What might break as a result? I would really appreciate comments. thanks -MoniS -- diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 07deee8..31bc6d8 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -216,6 +216,7 @@ struct ipoib_neigh { struct sk_buff_head queue; struct neighbour *neighbour; + struct net_device *dev; struct list_headlist; }; @@ -232,7 +233,8 @@ static inline struct ipoib_neigh **to_ip INFINIBAND_ALEN, sizeof(void *)); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh, + struct net_device *dev); void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh); extern struct workqueue_struct *ipoib_workqueue; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 705eb1d..0e3953e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -48,6 +48,8 @@ #include linux/ip.h #include linux/in.h #include net/dst.h +#include net/arp.h +#include net/ndisc.h #define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) 0xff) @@ -70,6 +72,7 @@ module_param_named(debug_level, ipoib_de MODULE_PARM_DESC(debug_level, Enable debug tracing if 0); #endif +static int ipoib_at_exit = 0; struct ipoib_path_iter { struct net_device
Re: [openib-general] [PATCH for-2.6.21] IPoIB/cm: improve small message bandwidth
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: [openib-general] [PATCH for-2.6.21] IPoIB/cm: improve small message bandwidth OK, I applied the following patch (I had to change one line of your patch to get it to apply because the small-message changed the context so one chunk didn't apply). Anyway I don't see any difference in small message latency or large message throughput. (Actually latency seems slightly worse but I think the change is within my normal variability so I'm don't think the difference is significant) OK. I wonder whether unrolling the loop in skb_put_frags might be helpful. Could you please try the following? Does this affect latency for you? (I don't see any difference in between UD and CM either with or without this patch). Try to improve small message latency some more by unrolling more loops. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] --- diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index a389854..a8895b4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -311,38 +311,6 @@ static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, return 0; } } -/* Adjust length of skb with fragments to match received data */ -static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space, - unsigned int length, struct sk_buff *toskb) -{ - int i, num_frags; - unsigned int size; - - /* put header into skb */ - size = min(length, hdr_space); - skb-tail += size; - skb-len += size; - length -= size; - - num_frags = skb_shinfo(skb)-nr_frags; - for (i = 0; i num_frags; i++) { - skb_frag_t *frag = skb_shinfo(skb)-frags[i]; - - if (length == 0) { - /* don't need this page */ - skb_fill_page_desc(toskb, i, frag-page, 0, PAGE_SIZE); - --skb_shinfo(skb)-nr_frags; - } else { - size = min(length, (unsigned) PAGE_SIZE); - - frag-size = size; - skb-data_len += size; - skb-truesize += size; - skb-len += size; - length -= size; - } - } -} void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) { @@ -352,7 +320,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) struct ipoib_cm_rx *p; unsigned long flags; u64 mapping[IPOIB_CM_RX_SG]; - int frags; + unsigned head_size, frag_size, frags; ipoib_dbg_data(priv, cm recv completion: id %d, op %d, status: %d\n, wr_id, wc-opcode, wc-status); @@ -388,8 +356,9 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) } } - frags = PAGE_ALIGN(wc-byte_len - min(wc-byte_len, - (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; + head_size = min(wc-byte_len, (unsigned)IPOIB_CM_HEAD_SIZE); + frag_size = wc-byte_len - head_size; + frags = PAGE_ALIGN(frag_size) / PAGE_SIZE; newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping); if (unlikely(!newskb)) { @@ -408,7 +377,18 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) ipoib_dbg_data(priv, received %d bytes, SLID 0x%04x\n, wc-byte_len, wc-slid); - skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc-byte_len, newskb); + memcpy(skb_shinfo(newskb)-frags[frags], skb_shinfo(skb)-frags[frags], + (IPOIB_CM_RX_SG - 1 - frags) * sizeof(skb_frag_t)); + skb_shinfo(newskb)-nr_frags = IPOIB_CM_RX_SG - 1; + + skb_shinfo(skb)-nr_frags = frags; + skb-tail += head_size; + skb-len += wc-byte_len; + skb-data_len += frag_size; + skb-truesize += frag_size; + if (frags) + skb_shinfo(skb)-frags[frags - 1].size = + (frag_size - 1) % PAGE_SIZE + 1; skb-protocol = ((struct ipoib_header *) skb-data)-proto; skb-mac.raw = skb-data; -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [openfabrics-ewg] new OFED 1.2 package
Quoting r. Woodruff, Robert J [EMAIL PROTECTED]: Subject: Re: [openfabrics-ewg] new OFED 1.2 package I am also still seeing the issue with the rdma_cm abi_version on RedHat EL4-U3, bug number, 347. The bug report contains the patch that should fix this. OK, here's a somewhat cleaned-up patch. However, I have a question: should not the module cleanup function remove ucma_class and class_attr_abi_version that were created at module initialization? diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index e2e8d32..e9e024e 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -847,13 +847,12 @@ static struct miscdevice ucma_misc = { .fops = ucma_fops, }; -static ssize_t show_abi_version(struct device *dev, - struct device_attribute *attr, - char *buf) +static struct class *ucma_class; +static ssize_t show_abi_version(struct class *class_dev, char *buf) { return sprintf(buf, %d\n, RDMA_USER_CM_ABI_VERSION); } -static DEVICE_ATTR(abi_version, S_IRUGO, show_abi_version, NULL); +static CLASS_ATTR(abi_version, S_IRUGO, show_abi_version, NULL); static int __init ucma_init(void) { @@ -863,7 +862,13 @@ static int __init ucma_init(void) if (ret) return ret; - ret = device_create_file(ucma_misc.this_device, dev_attr_abi_version); + ucma_class = class_create(THIS_MODULE, infiniband_ucma); + if (IS_ERR(ucma_class)) { + printk(KERN_ERR rdma_ucm: couldn't create class infiniband_ucma\n); + goto err; + } + + ret = class_create_file(ucma_class, class_attr_abi_version); if (ret) { printk(KERN_ERR rdma_ucm: couldn't create abi_version attr\n); goto err; @@ -876,7 +881,6 @@ err: static void __exit ucma_cleanup(void) { - device_remove_file(ucma_misc.this_device, dev_attr_abi_version); misc_deregister(ucma_misc); idr_destroy(ctx_idr); } -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] bugs filed for problems compiling OFED 1.2 alpha1
Quoting Scott Weitzenkamp (sweitzen) [EMAIL PROTECTED]: Subject: bugs filed for problems compiling OFED 1.2 alpha1 Please fix these bugs for beta. Scott, you have assigned all bugs to [EMAIL PROTECTED] To have the bugs resolved, please assign them to maintainers of appropriate module. For example, bonding module owner is Moni Shoua [EMAIL PROTECTED], so I think bug 384 should be assigned to him. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] bugs filed for problems compiling OFED 1.2 alpha1
Quoting Scott Weitzenkamp (sweitzen) [EMAIL PROTECTED]: Subject: bugs filed for problems compiling OFED 1.2 alpha1 Please fix these bugs for beta. I've compiled for RHEL4 and SLES10 on x86_64, i686, ia64, and ppc64. I compiled all MPIs with GNU, Intel, and PGI compilers. • 380 OFED 1.2 alpha1 gcc MVAPICH won't compile on RHEL4 IA64 • 381 OFED 1.2 alpha1 MVAPICH2 won't compile on RHEL4 IA64 with Intel compiler • 382 OFED 1.2 alpha1 mpitests won't compile with Intel compiler for Open MPI (RHEL4 IA64) • 383 OFED 1.2 alpha1 core/addr.c won't compile on SLES10 IA64 • 384 OFED 1.2 alpha1 ib-bonding won't compile on RHEL4 U3 ppc64 • 386 OFED 1.2 alpha1 gcc MVAPICH2 won't compile on RHEL4 ppc64 (add -m64) • 387 OFED 1.2 alpha1 Open MPI won't compile on SLES10 ppc64 Some of these might be fixed in recent nightly builds. Specifically I know 383 was fixed yesterday. Please check this and let us know. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] bugs filed for problems compiling OFED 1.2 alpha1
Quoting Scott Weitzenkamp (sweitzen) [EMAIL PROTECTED]: Subject: RE: bugs filed for problems compiling OFED 1.2 alpha1 Scott, you have assigned all bugs to [EMAIL PROTECTED] To have the bugs resolved, please assign them to maintainers of appropriate module. Not sure what you mean by all, only 384 was not assigned to a specific person. Correct. Sorry about that. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] libmthca: optimize calls to htonl with constant parameter
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: [PATCH] libmthca: optimize calls to htonl with constant parameter Newer gccs have the -fwhole-program --combine options that address this and more. One of the things that happens is that all internal functions are made 'static' and all compilation units are optimized in one go. Good point... but is there any sane way to use that feature with automake and libtool? I know that the autotools are a pain but I really don't want to reimplement the useful stuff they give us, and I don't know of any really practical replacement... Once KDE4 is out, I expect that most systems will start shipping cmake. Maybe it'll be practical to switch to that then. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] libmthca: optimize calls to htonl with constant parameter
So what is different in your setup that causes this patch to make a difference for you? Hmm. I agree it is somewhat strange. Below is a simple test that attempts to compare htonl, CONSTANT_HTONL, and an array-driven implementation. The code line is taken directly from htonl. Could you compile and run it please? I see: $ gcc -O2 1.c $ ./a.out test1 122396.00 usec test2 10517799.00 usec test3 104099.00 usec which seems to imply CONSTANT_HTONL is much faster. Ideas? --- #include stdio.h #include sys/time.h #include time.h #include endian.h #define SIZE 255 enum ibv_send_flags { IBV_SEND_FENCE = 1 0, IBV_SEND_SIGNALED = 1 1, IBV_SEND_SOLICITED = 1 2, IBV_SEND_INLINE = 1 3 }; enum { MTHCA_NEXT_DBD = 1 7, MTHCA_NEXT_FENCE = 1 6, MTHCA_NEXT_CQ_UPDATE = 1 3, MTHCA_NEXT_EVENT_GEN = 1 2, MTHCA_NEXT_SOLICIT = 1 1, }; int ar[SIZE]; void init_ar() { ar[0]=htonl(1); ar[IBV_SEND_SIGNALED]=htonl(MTHCA_NEXT_CQ_UPDATE|1);; ar[IBV_SEND_SOLICITED]=htonl(MTHCA_NEXT_SOLICIT|1);; ar[IBV_SEND_SIGNALED|IBV_SEND_SOLICITED]=htonl(MTHCA_NEXT_CQ_UPDATE|MTHCA_NEXT_SOLICIT|1);; } int test1(int x) { return ar[x (IBV_SEND_SIGNALED | IBV_SEND_SOLICITED)]; } int test2(int x) { return ((x IBV_SEND_SIGNALED) ? htonl(MTHCA_NEXT_CQ_UPDATE) : 0) | ((x IBV_SEND_SOLICITED) ? htonl(MTHCA_NEXT_SOLICIT) : 0) | htonl(1); } #if __BYTE_ORDER == __LITTLE_ENDIAN #define CONSTANT_HTONL(x) \ ((x 24) | ((x 8) 0xff00) | ((x 8) 0xff) | (x 24)) #elif __BYTE_ORDER == __BIG_ENDIAN #define CONSTANT_HTONL(x) (x) #else #define CONSTANT_HTONL(x) htonl(x) #endif int test3(int x) { return ((x IBV_SEND_SIGNALED) ? CONSTANT_HTONL(MTHCA_NEXT_CQ_UPDATE) : 0) | ((x IBV_SEND_SOLICITED) ? CONSTANT_HTONL(MTHCA_NEXT_SOLICIT) : 0) | CONSTANT_HTONL(1); } struct timeval start, end; void timestart(void) { if (gettimeofday(start, NULL)) { perror(gettimeofday); return; } } void timeend(void) { if (gettimeofday(end, NULL)) { perror(gettimeofday); return; } { float usec = (end.tv_sec - start.tv_sec) * 100 + (end.tv_usec - start.tv_usec); printf(%.2f usec\n, usec); } } main() { int i; init_ar(); printf(test1\n); timestart(); for (i=0; i1; ++i) { (void) test1(IBV_SEND_SIGNALED); (void) test1(0); (void) test1(IBV_SEND_SIGNALED | IBV_SEND_SOLICITED); (void) test1(IBV_SEND_SOLICITED); } timeend(); printf(test2\n); timestart(); for (i=0; i1; ++i) { (void) test2(IBV_SEND_SIGNALED); (void) test2(0); (void) test2(IBV_SEND_SIGNALED | IBV_SEND_SOLICITED); (void) test2(IBV_SEND_SOLICITED); } timeend(); printf(test3\n); timestart(); for (i=0; i1; ++i) { (void) test3(IBV_SEND_SIGNALED); (void) test3(0); (void) test3(IBV_SEND_SIGNALED | IBV_SEND_SOLICITED); (void) test3(IBV_SEND_SOLICITED); } timeend(); } -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] libmthca: optimize calls to htonl with constant parameter
Quoting Michael S. Tsirkin [EMAIL PROTECTED]: Subject: Re: [PATCH] libmthca: optimize calls to htonl with constant parameter So what is different in your setup that causes this patch to make a difference for you? Hmm. I agree it is somewhat strange. Below is a simple test that attempts to compare htonl, CONSTANT_HTONL, and an array-driven implementation. The code line is taken directly from htonl. Could you compile and run it please? OK, this was stupid, the test was missing #include netinet/in.h so htonl was expanded by a gcc intrinsic which seems to work worse than the macro tricks present in netinet/in.h. I guess this include got killed on the test system somehow, and this explains why I saw a difference in libmthca. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] anyone have OFED 1.2 alpha1 compiling on ppc64
Quoting Scott Weitzenkamp (sweitzen) [EMAIL PROTECTED]: Subject: anyone have OFED 1.2 alpha1 compiling on ppc64 I tried both RHEL4 and SLES10 usinstall.sh, and get this. I filed bug 379, anyone else tried ppc64? Scott, could pls you upload the kernel sources and .config files to staging? If you do, we'll be able to add these to mightly cross-build environment. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] libibverbs: can't compile more than once due to man3 symbolic links
Quoting Jack Morgenstein [EMAIL PROTECTED]: Subject: libibverbs: can't compile more than once due to man3 symbolic links The code below was just added to libibverbs/Makefile.am install-data-hook: cd $(DESTDIR)$(mandir)/man3 \ $(LN_S) ibv_get_async_event.3 ibv_ack_async_event.3 \ This creates a problem when re-compiling/re-installing libibverbs -- the ln -s ( = $(LN_S) ) fails because the symbolic links still exist in the man/man3 directory. I rummaged around the libtool documentation, and there is no pre-defined macro which does ln -fs (which would just overwrite the current links). Any ideas on how to fix this problem cleanly (i.e., without violating spirit of libtool/automake)? Probably just add $(RM) ibv_ack_async_event.3 \ -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] librdmacm examples not working under OFED 1.2 alpha
librdmacm: couldn't read ABI version. librdmacm: assuming: 4 I think there was a kernel patch from Woody to address this. Woody? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH for-2.6.21] IPoIB/cm: improve small message bandwidth
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: [PATCH for-2.6.21] IPoIB/cm: improve small message bandwidth Thanks, queued for 2.6.21. With this patch I see small-packet latency down almost all the way back to what datagram mode gives -- on a pair of fast woodcrest systems I see latencies for netpipe tcp 1 byte messages like datagram 13.xx original CM 17.xx patched CM 14.xx so there is still a measurable difference but it is much less now. Hmm. An old system I tried here has a much higher latency, but does not seem to exhibit latency difference between datagram and CM. 1. Is there something special you do when you run the benchmark (msi, taskset, ...)? 2. On a wild guess that the issue here is higher interrupt rate with CM, is there a chance you could test the following patch posted by me earlier? http://www.mail-archive.com/openib-general@openib.org/msg29290.html Thanks, -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] anyone have OFED 1.2 alpha1 compiling on ppc64
Don't you have an account at ssh.openfabrics.org? If yes, just put kernel sources and the .config under your home directory Quoting r. Scott Weitzenkamp (sweitzen) [EMAIL PROTECTED]: Subject: Re: anyone have OFED 1.2 alpha1 compiling on ppc64 How do I upload sources? -Original Message- From: Michael S. Tsirkin [mailto:[EMAIL PROTECTED] Sent: Thursday, February 22, 2007 1:00 AM To: Scott Weitzenkamp (sweitzen) Cc: [EMAIL PROTECTED]; OPENIB Subject: Re: anyone have OFED 1.2 alpha1 compiling on ppc64 Quoting Scott Weitzenkamp (sweitzen) [EMAIL PROTECTED]: Subject: anyone have OFED 1.2 alpha1 compiling on ppc64 I tried both RHEL4 and SLES10 usinstall.sh, and get this. I filed bug 379, anyone else tried ppc64? Scott, could pls you upload the kernel sources and .config files to staging? If you do, we'll be able to add these to mightly cross-build environment. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH for-2.6.21] IPoIB/cm: improve small message bandwidth
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: [PATCH for-2.6.21] IPoIB/cm: improve small message bandwidth OK, I applied the following patch (I had to change one line of your patch to get it to apply because the small-message changed the context so one chunk didn't apply). Anyway I don't see any difference in small message latency or large message throughput. (Actually latency seems slightly worse but I think the change is within my normal variability so I'm don't think the difference is significant) OK, thanks for testing this. I need to spend more time on reproducing this issue, and profiling. I'll add this to my todo list. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] anyone have OFED 1.2 alpha1 compiling on ppc64
Quoting Scott Weitzenkamp (sweitzen) [EMAIL PROTECTED]: Subject: RE: anyone have OFED 1.2 alpha1 compiling on ppc64 Don't you have an account at ssh.openfabrics.org? Can an admin please give me an account? I'm not an admin but I think you want to post your ssh public key. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
An API idea: how about instead testing missed_events, we add a flag: IB_CQ_TEST (or a longer name IB_CQ_REPORT_MISSED_EVENTS?) and change ib_req_notify_cq to return int which will keep the missed_events value, only if this flag is set? This has 2 advatages - Less churn updating all users to new API - they just ignore return value - and still almost no overhead for them as they don't set IB_CQ_TEST - For all users we have to push less values on stack - note compiler can't get rid of them as we are calling function through a pointer - For users that do missed_events = ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP | IB_CQ_TEST) we get the result in register. Yes, I like this. So ib_req_notify_cq() gets a return value that is negative if an error occurred, 0 if everything is fine, or positive if a missed event might have happened. I think I prefer the longer name IB_CQ_REPORT_MISSED_EVENTS -- at least there's a chance at guessing what it means even if you don't read the documentation. By the way, how about extending the userspace API in a similiar fashion? missed_events = ibv_req_notify_cq(priv-cq, IBV_CQ_NEXT_COMP | IBV_CQ_REPORT_MISSED_EVENTS) -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: IPOIB NAPI By the way, how about extending the userspace API in a similiar fashion? missed_events = ibv_req_notify_cq(priv-cq, IBV_CQ_NEXT_COMP | IBV_CQ_REPORT_MISSED_EVENTS) It would require a kernel-user ABI bump. Is it worth it? I hear some people asking for it: I imagine reasons are same as NAPI - race-free, clean API to switch from polling to event mode - rather than a minor optimization. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] libmthca: optimize calls to htonl with constant parameter
GCC seems to be unable to propogate constants across calls to htonl. So it turns out to be worth the while to replace htonl with a hand-written macro in case of constant parameter. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] Signed-off-by: Ishai Rabinovitz [EMAIL PROTECTED] --- Roland, I'm looking at micro-optimizing libmthca/mthca some more. The following optimization is minor, but it seems quite safe. What do you think? Tested with gcc 4.0.3. diff --git a/src/cq.c b/src/cq.c index 0aeb7a9..9428f74 100644 --- a/src/cq.c +++ b/src/cq.c @@ -275,7 +275,7 @@ static int handle_error_cqe(struct mthca_cq *cq, * doorbell count field. In that case we always free the CQE. */ if (mthca_is_memfree(cq-ibv_cq.context) || - !(new_wqe htonl(0x3f)) || (!cqe-db_cnt dbd)) + !(new_wqe CONSTANT_HTONL(0x3f)) || (!cqe-db_cnt dbd)) return 0; cqe-db_cnt = htons(ntohs(cqe-db_cnt) - dbd); diff --git a/src/mthca.h b/src/mthca.h index 1f31bc3..798029f 100644 --- a/src/mthca.h +++ b/src/mthca.h @@ -112,6 +112,20 @@ enum { MTHCA_OPCODE_INVALID= 0xff }; +/* GCC does not seem to be able to do constant propogation + * across htonl/ntohl calls */ +#if __BYTE_ORDER == __LITTLE_ENDIAN +#define CONSTANT_HTONL(x)\ + unsigned)x) 24) | \ +unsigned)x) 8) 0xff00) | \ +unsigned)x) 8) 0xff) | \ +(((unsigned)x) 24)) +#elif __BYTE_ORDER == __BIG_ENDIAN +#define CONSTANT_HTONL(x) (x) +#else +#define CONSTANT_HTONL(x) htonl(x) +#endif + struct mthca_ah_page; struct mthca_device { diff --git a/src/qp.c b/src/qp.c index f2483e9..85d3385 100644 --- a/src/qp.c +++ b/src/qp.c @@ -138,10 +138,10 @@ int mthca_tavor_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, ((struct mthca_next_seg *) wqe)-ee_nds = 0; ((struct mthca_next_seg *) wqe)-flags = ((wr-send_flags IBV_SEND_SIGNALED) ? -htonl(MTHCA_NEXT_CQ_UPDATE) : 0) | +CONSTANT_HTONL(MTHCA_NEXT_CQ_UPDATE) : 0) | ((wr-send_flags IBV_SEND_SOLICITED) ? -htonl(MTHCA_NEXT_SOLICIT) : 0) | - htonl(1); +CONSTANT_HTONL(MTHCA_NEXT_SOLICIT) : 0) | + CONSTANT_HTONL(1); if (wr-opcode == IBV_WR_SEND_WITH_IMM || wr-opcode == IBV_WR_RDMA_WRITE_WITH_IMM) ((struct mthca_next_seg *) wqe)-imm = wr-imm_data; @@ -359,9 +359,9 @@ int mthca_tavor_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, ((struct mthca_next_seg *) wqe)-nda_op = 0; ((struct mthca_next_seg *) wqe)-ee_nds = - htonl(MTHCA_NEXT_DBD); + CONSTANT_HTONL(MTHCA_NEXT_DBD); ((struct mthca_next_seg *) wqe)-flags = - htonl(MTHCA_NEXT_CQ_UPDATE); + CONSTANT_HTONL(MTHCA_NEXT_CQ_UPDATE); wqe += sizeof (struct mthca_next_seg); size = sizeof (struct mthca_next_seg) / 16; @@ -505,10 +505,10 @@ int mthca_arbel_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, ((struct mthca_next_seg *) wqe)-flags = ((wr-send_flags IBV_SEND_SIGNALED) ? -htonl(MTHCA_NEXT_CQ_UPDATE) : 0) | +CONSTANT_HTONL(MTHCA_NEXT_CQ_UPDATE) : 0) | ((wr-send_flags IBV_SEND_SOLICITED) ? -htonl(MTHCA_NEXT_SOLICIT) : 0) | - htonl(1); +CONSTANT_HTONL(MTHCA_NEXT_SOLICIT) : 0) | + CONSTANT_HTONL(1); if (wr-opcode == IBV_WR_SEND_WITH_IMM || wr-opcode == IBV_WR_RDMA_WRITE_WITH_IMM) ((struct mthca_next_seg *) wqe)-imm = wr-imm_data; @@ -750,7 +750,7 @@ int mthca_arbel_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, if (i qp-rq.max_gs) { ((struct mthca_data_seg *) wqe)-byte_count = 0; - ((struct mthca_data_seg *) wqe)-lkey = htonl(MTHCA_INVAL_LKEY); + ((struct mthca_data_seg *) wqe)-lkey = CONSTANT_HTONL(MTHCA_INVAL_LKEY); ((struct mthca_data_seg *) wqe)-addr = 0; } @@ -872,7 +872,7 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd, struct ibv_qp_cap *cap, for (scatter = (void *) (next + 1); (void *) scatter (void *) next + (1 qp-rq.wqe_shift); ++scatter) - scatter-lkey = htonl(MTHCA_INVAL_LKEY); + scatter-lkey = CONSTANT_HTONL(MTHCA_INVAL_LKEY
Re: [openib-general] [PATCH for-2.6.21] IPoIB/cm: improve small message bandwidth
Quoting r. Or Gerlitz [EMAIL PROTECTED]: Subject: Re: [openib-general] [PATCH for-2.6.21] IPoIB/cm: improve small message bandwidth Michael S. Tsirkin wrote: Avoid overhead of freeing/reallocating and mapping/unmapping for dma for pages that have not been written to by hardware. diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 8ee6f06..a23c8e3 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -68,14 +68,14 @@ struct ipoib_cm_id { static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); -static void ipoib_cm_dma_unmap_rx(struct ipoib_dev_priv *priv, +static void ipoib_cm_dma_unmap_rx(struct ipoib_dev_priv *priv, int frags, u64 mapping[IPOIB_CM_RX_SG]) { int i; ib_dma_unmap_single(priv-ca, mapping[0], IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE); - for (i = 0; i IPOIB_CM_RX_SG - 1; ++i) + for (i = 0; i frags; ++i) ib_dma_unmap_single(priv-ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); } I understand that in ipoib_cm_alloc_rx_skb you call dma_map_page on IPOIB_CM_RX_SG pages where here you call dma_unmap_single only $frags times, correct? No. does this means you are trashing the IOMMU etc etc of the system? I don't think so. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Fwd: Address List Change for Friday, 2/23/2007
Could an example message be please sent *today* to the new list, so that client rules can be updated? I can't access my inbox on Friday or Saturday, and this change will cause problems and message loss for me unless I can prepare beforehand. Quoting Jeff Squyres [EMAIL PROTECTED]: Subject: Fwd: Address List Change for Friday, 2/23/2007 FYI. In case you missed it the first time: THIS LIST IS CHANGING ON FRIDAY 2/23/2007 (2 days from now). Please update your addressbooks! See the notice below for the details. Begin forwarded message: From: Lee, Michael Paichi [EMAIL PROTECTED] Date: February 19, 2007 10:43:23 AM EST To: openib-general@openib.org Subject: [openib-general] Address List Change for Friday, 2/23/2007 We're in the process of migrating the maillists from the old openib.org server to the new lists.openfabrics.org machine. The list openib-general will be moved this Friday, February 23, 2007. The new address for the maillist will be [EMAIL PROTECTED] What this means is that messages will come from [EMAIL PROTECTED] Conversely, replies should be made to this address as well. Messages will also have a new subject line prefix of [OFA General]. If you have configured your e-mail client to filter based on maillist address or subject headers, you may need to make some adjustments for filtering. However, for the sake of transition, messages sent to the previous maillist address on the old server will forward to the new server. This forward will remain in place until the old server is taken offline and final DNS changes are made. We expect the old server to go offline sometime in early March. The web archives will also be migrated to the new web address shortly, http://lists.openfabrics.org. If you have any questions, please don't hesitate to contact me at [EMAIL PROTECTED] Regards, Michael Lee ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/ openib-general -- Jeff Squyres Server Virtualization Business Unit Cisco Systems ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH for-2.6.21] IPoIB/cm: improve small message bandwidth
Avoid overhead of freeing/reallocating and mapping/unmapping for dma for pages that have not been written to by hardware. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] --- This gives 10% boost in BW for message sizes up to 32K. Please queue for 2.6.21. before: # ./netperf-2.4.2/src/netperf -f M -H 11.4.3.68 -c -C -- -m 32000 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 (11.4.3.68) port 0 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.MBytes /s % S % S us/KB us/KB 87380 16384 3200010.00 716.23 26.2223.941.430 1.306 after: # ./netperf-2.4.2/src/netperf -f M -H 11.4.3.68 -c -C -- -m 32000 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 (11.4.3.68) port 0 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.MBytes /s % S % S us/KB us/KB 87380 16384 3200010.00 888.67 24.1325.081.061 1.102 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 8ee6f06..a23c8e3 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -68,14 +68,14 @@ struct ipoib_cm_id { static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); -static void ipoib_cm_dma_unmap_rx(struct ipoib_dev_priv *priv, +static void ipoib_cm_dma_unmap_rx(struct ipoib_dev_priv *priv, int frags, u64 mapping[IPOIB_CM_RX_SG]) { int i; ib_dma_unmap_single(priv-ca, mapping[0], IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE); - for (i = 0; i IPOIB_CM_RX_SG - 1; ++i) + for (i = 0; i frags; ++i) ib_dma_unmap_single(priv-ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); } @@ -93,7 +93,8 @@ static int ipoib_cm_post_receive(struct net_device *dev, int id) ret = ib_post_srq_recv(priv-cm.srq, priv-cm.rx_wr, bad_wr); if (unlikely(ret)) { ipoib_warn(priv, post srq failed for buf %d (%d)\n, id, ret); - ipoib_cm_dma_unmap_rx(priv, priv-cm.srq_ring[id].mapping); + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, + priv-cm.srq_ring[id].mapping); dev_kfree_skb_any(priv-cm.srq_ring[id].skb); priv-cm.srq_ring[id].skb = NULL; } @@ -101,8 +102,8 @@ static int ipoib_cm_post_receive(struct net_device *dev, int id) return ret; } -static int ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, -u64 mapping[IPOIB_CM_RX_SG]) +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, +u64 mapping[IPOIB_CM_RX_SG]) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; @@ -110,7 +111,7 @@ static int ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12); if (unlikely(!skb)) - return -ENOMEM; + return NULL; /* * IPoIB adds a 4 byte header. So we need 12 more bytes to align the @@ -122,10 +123,10 @@ static int ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, DMA_FROM_DEVICE); if (unlikely(ib_dma_mapping_error(priv-ca, mapping[0]))) { dev_kfree_skb_any(skb); - return -EIO; + return NULL; } - for (i = 0; i IPOIB_CM_RX_SG - 1; i++) { + for (i = 0; i frags; i++) { struct page *page = alloc_page(GFP_ATOMIC); if (!page) @@ -139,7 +140,7 @@ static int ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, } priv-cm.srq_ring[id].skb = skb; - return 0; + return skb; partial_error: @@ -148,8 +149,8 @@ partial_error: for (; i = 0; --i) ib_dma_unmap_single(priv-ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); - dev_kfree_skb_any(skb); - return -ENOMEM; + dev_kfree_skb_any(skb); + return NULL; } static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev, @@ -312,7 +313,7 @@ static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, } /* Adjust length of skb with fragments to match received data */ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space, - unsigned int length) + unsigned int
Re: [openib-general] 32-bit build for ppc64 is required
Quoting Hoang-Nam Nguyen [EMAIL PROTECTED]: Subject: Re: 32-bit build for ppc64 is required So, what you suggest is - build 2 types of libraries, but on PPC make binaries 32 bit? That's easy - do others agree to this approach? No, for execs please create 32bit and 64bit on PPC. Another option is to build binaries with whatever type of binary gcc without extra flags generates by default. On PPC we really need the ability to build both versions. The reason is simply that there're customers who want them. Why don't offer both options, and each component owner can decide her/his default? I don't think this can be elegantly dealt with on a per-component basis. But if some component owner has an opinion on this, do speak up - note this affects binaries only, not libraries such as libehca. And the customers can pick the one(s) they like. I see your point regarding QA effort. Is it really twice? Probably more - I'm reasonably sure most scripts written so far assume stuff is installed in prefix/bin, so testing harness etc would need to be changed. And how to make sure the *correct* set of binaries was actually QA'd? My assumption might be wrong: I guess we have to assure/test the 32/64bit compatibility anyway eg. 32bit client talks to 64bit server. If we have 32bit execs only for development resp. testing, why don't we also give them to customers in order to do basic test or diagnosis of their setup? Because of the confusion this would create. For shared libraries, the 32/64 bit issues seem to be automatically figured out by ld.so, but there's no such solution for binaries. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] 32-bit build for ppc64 is required
Quoting Doug Ledford [EMAIL PROTECTED]: Subject: Re: 32-bit build for ppc64 is required On Thu, 2007-02-15 at 07:57 +0200, Michael S. Tsirkin wrote: The choice of 32/64 bit default is done on a per arch basis. With x86_64/i386, the increased number of CPU registers in 64bit mode outweighs the increased code bloat that goes along with 64bit mode. On PPC, no such register benefit exists for 64bit mode. As such, 32bit apps on PPC are faster than the equivalent 64bit apps up to the point at which a 4GB address space becomes a problem. Correspondingly, the default binaries on PPC are 32bit, and only those that *need* to be 64bit are. While a customer's application may need 4GB address space, certainly all the ibutils, diags, opensm, etc. do not. As a result, we compile all of those utilities as 32bit by default on PPC. We also ship all the libs as both 32/64bit so users can select the appropriate environment for their particular application (with the exception of dapl, which doesn't support 32bit and for which I filed a bug around the time of OFED 1.1). So, what you suggest is - build 2 types of libraries, but on PPC make binaries 32 bit? That's easy - do others agree to this approach? Yep, that's what we do. Care to post a patch to Vlad's scripts? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH for-2.6.21] IB/ipoib: error handling thinko fix
ipoib_cm_alloc_rx_skb might be called from IRQ context, so it must use dev_kfree_skb_any, not kfree_skb Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] --- This one's obvious, isn't it? diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 8ee6f06..e388a41 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -148,7 +148,7 @@ partial_error: for (; i = 0; --i) ib_dma_unmap_single(priv-ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); - kfree_skb(skb); + dev_kfree_skb_any(skb); return -ENOMEM; } -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH 3 of 4] IB/mthca: fix non-cache-coherent CPUs with memfree
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: [PATCH 3 of 4] IB/mthca: fix non-cache-coherent CPUs with memfree OK, I already merged this but now I'm thinking it's somewhat buggy: Hopefully not. + if (coherent) + ret = mthca_alloc_icm_coherent(dev-pdev-dev, + chunk-mem[chunk-npages], + cur_order, gfp_mask); + else + ret = mthca_alloc_icm_pages(chunk-mem[chunk-npages], + cur_order, gfp_mask); - if (++chunk-npages == MTHCA_ICM_CHUNK_LEN) { + if (!ret) { + ++chunk-npages; + + if (!coherent chunk-npages == MTHCA_ICM_CHUNK_LEN) { chunk-nsg = pci_map_sg(dev-pdev, chunk-mem, I don't see anything that ever bumps chunk-nsg if we're allocating a coherent region and we end up needing more than one allocation to do it. Yes but this is intentional. Maybe something like this on top of the patch? diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c index 0b9d053..48f7c65 100644 --- a/drivers/infiniband/hw/mthca/mthca_memfree.c +++ b/drivers/infiniband/hw/mthca/mthca_memfree.c @@ -175,7 +175,9 @@ struct mthca_icm *mthca_alloc_icm(struct mthca_dev *dev, int npages, if (!ret) { ++chunk-npages; - if (!coherent chunk-npages == MTHCA_ICM_CHUNK_LEN) { + if (coherent) + ++chunk-nsg; + else if (chunk-npages == MTHCA_ICM_CHUNK_LEN) { chunk-nsg = pci_map_sg(dev-pdev, chunk-mem, chunk-npages, PCI_DMA_BIDIRECTIONAL); No, I think the code is fine and this patch will break things: chunk-nsg is needed only for non-coherent memory to call pci_unmap_sg: if (chunk-nsg 0) pci_unmap_sg(dev-pdev, chunk-mem, chunk-npages, PCI_DMA_BIDIRECTIONAL); and we do *not* want to call pci_unmap_sg on consistent memory. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [GIT PULL] please pull infiniband.git
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: [GIT PULL] please pull infiniband.git What about the patch that i sent on Allow the following QP state transition : reset -- reset? OK, I'll merge that in the next patch. It's the kind of patch I'm not happy about merging, since it bloats the code to handle a corner case no one is likely to hit in practice, but it is technically correct so I guess we're forced to merge it. I actually think this is a spec bug: IB spec requires the following transitions in modify qp: Any State to Error Any state to Reset If we are in reset state, both of these do not make any sense. To Error from reset is especially hard to implement, and I think this will actually create bugs in application (if it did to error I think its expecting completions with error). I'll try speaking with our representatives at the IBTA to remove this requirement. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH 3 of 4] IB/mthca: fix non-cache-coherent CPUs with memfree
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: [PATCH 3 of 4] IB/mthca: fix non-cache-coherent CPUs with memfree Queued for 2.6.21, although I think a further cleanup would be: mdev-mr_table.mpt_table = mthca_alloc_icm_table(mdev, init_hca-mpt_base, dev_lim-mpt_entry_sz, mdev-limits.num_mpts, - mdev-limits.reserved_mrws, 1); + mdev-limits.reserved_mrws, + 1, 1); instead of having use_lowmem and use_coherent be separate parameters, we should probably convert it to a type parameter, and have MTHCA_ICM_TABLE_HIGHMEM, _LOWMEM and _COHERENT. That would make these calls a lot easier to read and get correct. Good idea. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH 3 of 4] IB/mthca: fix non-cache-coherent CPUs with memfree
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: [PATCH 3 of 4] IB/mthca: fix non-cache-coherent CPUs with memfree + sg_set_buf(mem, buf, PAGE_SIZE order); + BUG_ON(mem-offset); + sg_dma_len(mem) = PAGE_SIZE order; What am I missing? Any reason to set sg_dma_len() again after sg_set_buf()? How do you mean, again? Does sg_set_buf set dma_length? In 2.6.20, I see this in include/linux/scatterlist.h: static inline void sg_set_buf(struct scatterlist *sg, const void *buf, unsigned int buflen) { sg-page = virt_to_page(buf); sg-offset = offset_in_page(buf); sg-length = buflen; } -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] 32-bit build for ppc64 is required
Quoting Stefan Roscher [EMAIL PROTECTED]: Subject: 32-bit build for ppc64 is required Hi, after building the latest ofed build package we recognized that on PPC64 only 64-bit libaries were build. Because we have customers using older userpace apllications which are certified for 32-bit we think additional 32bit support is a requirement for 64bit builds. If OFED 1.2 supports 32 bit on ppc64, we have to change the install directory.I would suggest to install 32-bit binaries into /usr/local/ofed/bin32 directory. So no changes on current naming conventions has to be done.The libaries are installed in the /usr/local/ofed/lib directory. The standard practice is to install 64 bit libraries under prefix/lib64 and 32 bit libraries under prefix/lib. Why would PPC64 be any different? I do not think we need 32 bit binaries at all, and there's no other package I'm aware of that uses bin32. Comments? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] 32-bit build for ppc64 is required
Quoting Stefan Roscher [EMAIL PROTECTED]: Subject: Re: 32-bit build for ppc64 is required On Wednesday 14 February 2007 14:29, Michael S. Tsirkin wrote: Quoting Stefan Roscher [EMAIL PROTECTED]: Subject: 32-bit build for ppc64 is required Hi, after building the latest ofed build package we recognized that on PPC64 only 64-bit libaries were build. Because we have customers using older userpace apllications which are certified for 32-bit we think additional 32bit support is a requirement for 64bit builds. If OFED 1.2 supports 32 bit on ppc64, we have to change the install directory.I would suggest to install 32-bit binaries into /usr/local/ofed/bin32 directory. So no changes on current naming conventions has to be done.The libaries are installed in the /usr/local/ofed/lib directory. The standard practice is to install 64 bit libraries under prefix/lib64 and 32 bit libraries under prefix/lib. Why would PPC64 be any different? I think you missunderstand my post. The directory for 32/64bit libaries shouldbe prefix/lib and prefix/lib64 respectively. But current ofed1.2 I saw only prefix/lib64 directory, ie 64bit libs only. Well, this is not by design: AFAIK on x86_64 both types of libraries are installed. I do not think we need 32 bit binaries at all, and there's no other package I'm aware of that uses bin32. We have customers that still use 32-bit userspace applications. It would be beneficial for them if they can obtain 32bit libs and execs from ofed1.2 in order to run their applications without recompiling them, because for some 32-bit applications recompiling is not an option. 32 bit libraries are needed for users to run 32 applications. But I still do not see how installing 32 bit binaries alongside the 64 bit ones is useful, and I do not think other packages provide this option, so maybe we shouldn't, either. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [Bug 325] RDMA_CM and address translation broken on sles9sp3
Quoting Steve Wise [EMAIL PROTECTED]: Subject: Re: [Bug 325] RDMA_CM and address translation broken on sles9sp3 Tziporet, I didn't think we were going to apply this patch until Michael tested it with SDP/IPoIB on various distros. Michael, did you get a chance to test it (I'm guessing not since you were out sick)? Right, I'm not at the lab. I assume Vlad tested this before applying. The reason I'm concerned is that it changes the behavior of xxx_ip_dev_find() and _all_ backports, and we needed to test it out and make sure it doesn't regress anything. If it causes problems on other backports, the plan was to just fix the sles9sp3 backport and leave the others alone. With the test build vlad published yesterday which has this patch, rhel4u4 kernel wasn't working for me with iWARP and I'm afraid it might be due to this patch. I'm investigating this now. In actual fact, xxx_ip_dev_find is not even *needed* on anything except 2.6.14, 2.6.15, 2.6.16 and 2.6.17: these are the kernels which do not export ip_dev_find. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH 3 of 4] IB/mthca: fix non-cache-coherent CPUs with memfree
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: [PATCH 3 of 4] IB/mthca: fix non-cache-coherent CPUs with memfree I don't see anything that ever bumps chunk-nsg if we're allocating a coherent region and we end up needing more than one allocation to do it. Yes but this is intentional. No, I think the code is fine and this patch will break things: chunk-nsg is needed only for non-coherent memory to call pci_unmap_sg: what about this code in mthca_memfree.h? static inline void mthca_icm_next(struct mthca_icm_iter *iter) { if (++iter-page_idx = iter-chunk-nsg) { Correct. Good catch. the call to pci_unmap_sg you're worried about is in mthca_free_icm_pages(), which can't be called for coherent memory anyway, so I don't see a problem with that. So I think my patch is correct and needed. Yes, I agree. I'll also put it in OFED. Thanks! -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] 32-bit build for ppc64 is required
Still, using non-standard hacks like bin32 does not sound like a good idea. I think the actual issue is there is no common approach for this on various platforms. On platforms I've seen, there are 2 sets of libraries but only 64 bit executables provided. This is what we had for OFED 1.0, OFED 1.1, and I don't see the reason to change that - adding more executables to install on production systems will double QA work. Maybe an option to *only* make 32 bit userspace might make sense though. Something like --disable-32bit, --disable-64bit. This would solve your problem, would it not? Does that mean if I don't specify one of them, I'll get 32- and 64bit execs? If yes, that's fine. No, by default we build 2 sets of libraries, and only 64 bit execs. But for your development purposes (I think you mentioned testing user/kernel context switch) I think we could have --disable-32bit flag to configure to get only 32 bit userspace. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] 32-bit build for ppc64 is required
Quoting Doug Ledford [EMAIL PROTECTED]: Subject: Re: [openib-general] 32-bit build for ppc64 is required On Wed, 2007-02-14 at 16:29 +0200, Michael S. Tsirkin wrote: Quoting Stefan Roscher [EMAIL PROTECTED]: Subject: Re: 32-bit build for ppc64 is required On Wednesday 14 February 2007 14:29, Michael S. Tsirkin wrote: Quoting Stefan Roscher [EMAIL PROTECTED]: Subject: 32-bit build for ppc64 is required Hi, after building the latest ofed build package we recognized that on PPC64 only 64-bit libaries were build. Because we have customers using older userpace apllications which are certified for 32-bit we think additional 32bit support is a requirement for 64bit builds. If OFED 1.2 supports 32 bit on ppc64, we have to change the install directory.I would suggest to install 32-bit binaries into /usr/local/ofed/bin32 directory. So no changes on current naming conventions has to be done.The libaries are installed in the /usr/local/ofed/lib directory. The standard practice is to install 64 bit libraries under prefix/lib64 and 32 bit libraries under prefix/lib. Why would PPC64 be any different? I think you missunderstand my post. The directory for 32/64bit libaries shouldbe prefix/lib and prefix/lib64 respectively. But current ofed1.2 I saw only prefix/lib64 directory, ie 64bit libs only. Well, this is not by design: AFAIK on x86_64 both types of libraries are installed. I do not think we need 32 bit binaries at all, and there's no other package I'm aware of that uses bin32. We have customers that still use 32-bit userspace applications. It would be beneficial for them if they can obtain 32bit libs and execs from ofed1.2 in order to run their applications without recompiling them, because for some 32-bit applications recompiling is not an option. 32 bit libraries are needed for users to run 32 applications. But I still do not see how installing 32 bit binaries alongside the 64 bit ones is useful, and I do not think other packages provide this option, so maybe we shouldn't, either. The choice of 32/64 bit default is done on a per arch basis. With x86_64/i386, the increased number of CPU registers in 64bit mode outweighs the increased code bloat that goes along with 64bit mode. On PPC, no such register benefit exists for 64bit mode. As such, 32bit apps on PPC are faster than the equivalent 64bit apps up to the point at which a 4GB address space becomes a problem. Correspondingly, the default binaries on PPC are 32bit, and only those that *need* to be 64bit are. While a customer's application may need 4GB address space, certainly all the ibutils, diags, opensm, etc. do not. As a result, we compile all of those utilities as 32bit by default on PPC. We also ship all the libs as both 32/64bit so users can select the appropriate environment for their particular application (with the exception of dapl, which doesn't support 32bit and for which I filed a bug around the time of OFED 1.1). So, what you suggest is - build 2 types of libraries, but on PPC make binaries 32 bit? That's easy - do others agree to this approach? Another option is to build binaries with whatever type of binary gcc without extra flags generates by default. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IPoIB: Only allow root to change between datagram and connected mode
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: [PATCH] IPoIB: Only allow root to change between datagram and connected mode Change the permissions of the mode sysfs attribute to be S_IWUSR instead of S_IWUGO. Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- FYI -- I'm planning to merge this for 2.6.21. It doesn't seem appropriate to allow ordinary users to mess with this sort of config. Acked-by: Michael S. Tsirkin [EMAIL PROTECTED] -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Fwd: [ANNOUNCE] GIT 1.5.0
FYI. I suggest we update git on the openfabrics server to 1.5.0: Detached HEAD feature will be useful for nightly build scripts. Sasha? - Forwarded message from Junio C Hamano [EMAIL PROTECTED] - Subject: [ANNOUNCE] GIT 1.5.0 Date: Wed, 14 Feb 2007 05:14:16 +0200 From: Junio C Hamano [EMAIL PROTECTED] The latest feature release GIT 1.5.0 is available at the usual places: http://www.kernel.org/pub/software/scm/git/ git-1.5.0.tar.{gz,bz2}(tarball) git-htmldocs-1.5.0.tar.{gz,bz2} (preformatted docs) git-manpages-1.5.0.tar.{gz,bz2} (preformatted docs) RPMS/$arch/git-*-1.5.0-1.$arch.rpm(RPM) GIT v1.5.0 Release Notes Old news This section is for people who are upgrading from ancient versions of git. Although all of the changes in this section happened before the current v1.4.4 release, they are summarized here in the v1.5.0 release notes for people who skipped earlier versions. As of git v1.5.0 there are some optional features that changes the repository to allow data to be stored and transferred more efficiently. These features are not enabled by default, as they will make the repository unusable with older versions of git. Specifically, the available options are: - There is a configuration variable core.legacyheaders that changes the format of loose objects so that they are more efficient to pack and to send out of the repository over git native protocol, since v1.4.2. However, loose objects written in the new format cannot be read by git older than that version; people fetching from your repository using older clients over dumb transports (e.g. http) using older versions of git will also be affected. - Since v1.4.3, configuration repack.usedeltabaseoffset allows packfile to be created in more space efficient format, which cannot be read by git older than that version. The above two are not enabled by default and you explicitly have to ask for them, because these two features make repositories unreadable by older versions of git, and in v1.5.0 we still do not enable them by default for the same reason. We will change this default probably 1 year after 1.4.2's release, when it is reasonable to expect everybody to have new enough version of git. - 'git pack-refs' appeared in v1.4.4; this command allows tags to be accessed much more efficiently than the traditional 'one-file-per-tag' format. Older git-native clients can still fetch from a repository that packed and pruned refs (the server side needs to run the up-to-date version of git), but older dumb transports cannot. Packing of refs is done by an explicit user action, either by use of git pack-refs --prune command or by use of git gc command. - 'git -p' to paginate anything -- many commands do pagination by default on a tty. Introduced between v1.4.1 and v1.4.2; this may surprise old timers. - 'git archive' superseded 'git tar-tree' in v1.4.3; - 'git cvsserver' was new invention in v1.3.0; - 'git repo-config', 'git grep', 'git rebase' and 'gitk' were seriously enhanced during v1.4.0 timeperiod. - 'gitweb' became part of git.git during v1.4.0 timeperiod and seriously modified since then. - reflog is an v1.4.0 invention. This allows you to name a revision that a branch used to be at (e.g. git diff [EMAIL PROTECTED] master allows you to see changes since yesterday's tip of the branch). Updates in v1.5.0 since v1.4.4 series - * Index manipulation - git-add is to add contents to the index (aka staging area for the next commit), whether the file the contents happen to be is an existing one or a newly created one. - git-add without any argument does not add everything anymore. Use 'git-add .' instead. Also you can add otherwise ignored files with an -f option. - git-add tries to be more friendly to users by offering an interactive mode (git-add -i). - git-commit path used to refuse to commit if path was different between HEAD and the index (i.e. update-index was used on it earlier). This check was removed. - git-rm is much saner and safer. It is used to remove paths from both the index file and the working tree, and makes sure you are not losing any local modification before doing so. - git-reset tree paths... can be used to revert index entries for selected paths. - git-update-index is much less visible. Many suggestions to use the command in git output and documentation have now been replaced by simpler commands such as git add or git rm. * Repository layout and objects transfer - The data for origin repository is stored in the configuration file $GIT_DIR/config, not in $GIT_DIR/remotes/, for newly created clones. The latter is still supported and there is no need to convert your
Re: [openib-general] [PATCH RFC] use common cq for ipoib cm send side
Quoting Michael S. Tsirkin [EMAIL PROTECTED]: Subject: [PATCH RFC] use common cq for ipoib cm send side The following untested patch moves all TX processing in IPoIB CM to common CQ. This should help reduce the number of interrupts for bi-directional traffic (such as TCP). Is this a good idea? What do others think? Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] FYI, this was just thinking aloud. The version below works fine here but the performance gain seems to be very small (about 1%). The gain with NAPI might be bigger but this is yet to be tested. I'll continue looking into this. Feedback wellcome. ipoib.h | 10 +-- ipoib_cm.c | 78 +++ ipoib_ib.c | 28 - ipoib_main.c |2 - 4 files changed, 45 insertions(+), 73 deletions(-) Use common CQ for all TX QPs: keep a per-device counter out outstanding tx WRs, and stop the interface when this counter reaches the send queue size, to avoid CQ overruns. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] --- diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index eb885ee..ef703c7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -99,9 +99,9 @@ enum { #defineIPOIB_OP_RECV (1ul 31) #ifdef CONFIG_INFINIBAND_IPOIB_CM -#defineIPOIB_CM_OP_SRQ (1ul 30) +#defineIPOIB_OP_CM (1ul 30) #else -#defineIPOIB_CM_OP_SRQ (0) +#defineIPOIB_OP_CM (0) #endif /* structs */ @@ -144,7 +144,6 @@ struct ipoib_cm_rx { struct ipoib_cm_tx { struct ib_cm_id *id; - struct ib_cq*cq; struct ib_qp*qp; struct list_head list; struct net_device *dev; @@ -233,6 +232,7 @@ struct ipoib_dev_priv { unsigned tx_tail; struct ib_sgetx_sge; struct ib_send_wrtx_wr; + unsigned tx_outstanding; struct ib_wc ibwc[IPOIB_NUM_WC]; @@ -439,6 +439,7 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx); void ipoib_cm_skb_too_long(struct net_device* dev, struct sk_buff *skb, unsigned int mtu); void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc); +void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc); #else struct ipoib_cm_tx; @@ -527,6 +528,9 @@ static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *w { } +static inline void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) +{ +} #endif #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 8ee6f06..af36562 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -85,7 +85,7 @@ static int ipoib_cm_post_receive(struct net_device *dev, int id) struct ib_recv_wr *bad_wr; int i, ret; - priv-cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; + priv-cm.rx_wr.wr_id = id | IPOIB_OP_CM | IPOIB_OP_RECV; for (i = 0; i IPOIB_CM_RX_SG; ++i) priv-cm.rx_sge[i].addr = priv-cm.srq_ring[id].mapping[i]; @@ -346,7 +346,7 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space, void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc-wr_id ~IPOIB_CM_OP_SRQ; + unsigned int wr_id = wc-wr_id ~(IPOIB_OP_CM | IPOIB_OP_RECV); struct sk_buff *skb; struct ipoib_cm_rx *p; unsigned long flags; @@ -433,7 +433,7 @@ static inline int post_send(struct ipoib_dev_priv *priv, priv-tx_sge.addr = addr; priv-tx_sge.length = len; - priv-tx_wr.wr_id = wr_id; + priv-tx_wr.wr_id = wr_id | IPOIB_OP_CM; return ib_post_send(tx-qp, priv-tx_wr, bad_wr); } @@ -484,20 +484,19 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ dev-trans_start = jiffies; ++tx-tx_head; - if (tx-tx_head - tx-tx_tail == ipoib_sendq_size) { + if (++priv-tx_outstanding == ipoib_sendq_size) { ipoib_dbg(priv, TX ring 0x%x full, stopping kernel net queue\n, tx-qp-qp_num); netif_stop_queue(dev); - set_bit(IPOIB_FLAG_NETIF_STOPPED, tx-flags); } } } -static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx, - struct ib_wc *wc) +void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc-wr_id; + struct ipoib_cm_tx *tx = wc
Re: [openib-general] please pull for 2.6.21: fix + add IB multicast support
+ It seems same goes for + mc = kzalloc(sizeof(*mc), GFP_KERNEL); + if (!mc) + return NULL; We would need to set events_reported. IMO, probably worth it to init just this one field rather than use up initialized memory - and I think it's clearer. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH 0/5] iw_cxgb3 - misc cleanup and fixes
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: [PATCH 0/5] iw_cxgb3 - misc cleanup and fixes Michael What about the mthca memory registration patches? I Michael thought they are on their way. Should I repost? Sorry, I forgot about that. Yes, please resend the latest state. OK, coming up. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH 1 of 4] IB/mthca: merge MR and FMR space on 64 bit
For Tavor, we currently reserve separate MPT and MTT space for FMRs to avoid abusing the vmalloc space on 32 bit kernels. No such problem exists on 64 bit kernels so let's not do it there. This way we have a shared pool for MR and FMR resources, used on demand. This will also make it possible to write MTTs for regular regions directly from driver. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] --- Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c === --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_mr.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c @@ -761,7 +761,7 @@ void mthca_arbel_fmr_unmap(struct mthca_ int mthca_init_mr_table(struct mthca_dev *dev) { unsigned long addr; - int err, i; + int mpts, mtts, err, i; err = mthca_alloc_init(dev-mr_table.mpt_alloc, dev-limits.num_mpts, @@ -795,13 +795,21 @@ int mthca_init_mr_table(struct mthca_dev err = -EINVAL; goto err_fmr_mpt; } + mpts = mtts = 1 i; + } else { + mpts = dev-limits.num_mtt_segs; + mtts = dev-limits.num_mpts; + } + + if (!mthca_is_memfree(dev) + (dev-mthca_flags MTHCA_FLAG_FMR)) { addr = pci_resource_start(dev-pdev, 4) + ((pci_resource_len(dev-pdev, 4) - 1) dev-mr_table.mpt_base); dev-mr_table.tavor_fmr.mpt_base = - ioremap(addr, (1 i) * sizeof(struct mthca_mpt_entry)); + ioremap(addr, mpts * sizeof(struct mthca_mpt_entry)); if (!dev-mr_table.tavor_fmr.mpt_base) { mthca_warn(dev, MPT ioremap for FMR failed.\n); @@ -814,19 +822,21 @@ int mthca_init_mr_table(struct mthca_dev dev-mr_table.mtt_base); dev-mr_table.tavor_fmr.mtt_base = - ioremap(addr, (1 i) * MTHCA_MTT_SEG_SIZE); + ioremap(addr, mtts * MTHCA_MTT_SEG_SIZE); if (!dev-mr_table.tavor_fmr.mtt_base) { mthca_warn(dev, MTT ioremap for FMR failed.\n); err = -ENOMEM; goto err_fmr_mtt; } + } - err = mthca_buddy_init(dev-mr_table.tavor_fmr.mtt_buddy, i); + if (dev-limits.fmr_reserved_mtts) { + err = mthca_buddy_init(dev-mr_table.tavor_fmr.mtt_buddy, fls(mtts - 1)); if (err) goto err_fmr_mtt_buddy; /* Prevent regular MRs from using FMR keys */ - err = mthca_buddy_alloc(dev-mr_table.mtt_buddy, i); + err = mthca_buddy_alloc(dev-mr_table.mtt_buddy, fls(mtts - 1)); if (err) goto err_reserve_fmr; Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_profile.c === --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_profile.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_profile.c @@ -277,7 +277,7 @@ u64 mthca_make_profile(struct mthca_dev * out of the MR pool. They don't use additional memory, but * we assign them as part of the HCA profile anyway. */ - if (mthca_is_memfree(dev)) + if (mthca_is_memfree(dev) || BITS_PER_LONG == 64) dev-limits.fmr_reserved_mtts = 0; else dev-limits.fmr_reserved_mtts = request-fmr_reserved_mtts; -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH 2 of 4] IB/mthca: always fill MTTs from CPU
Speed up memory registration by filling in MTTs directly. This reduces the number of FW commands needed to register an MR by at least a factor of 2. This applies to all memfree cards, and to tavor mode on 64 bit systems with the patch I posted earlier. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] --- Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h === --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h @@ -464,6 +464,8 @@ void mthca_uar_free(struct mthca_dev *de int mthca_pd_alloc(struct mthca_dev *dev, int privileged, struct mthca_pd *pd); void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); +int mthca_write_mtt_size(struct mthca_dev *dev); + struct mthca_mtt *mthca_alloc_mtt(struct mthca_dev *dev, int size); void mthca_free_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt); int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c === --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_mr.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c @@ -243,8 +243,8 @@ void mthca_free_mtt(struct mthca_dev *de kfree(mtt); } -int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, - int start_index, u64 *buffer_list, int list_len) +static int __mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, +int start_index, u64 *buffer_list, int list_len) { struct mthca_mailbox *mailbox; __be64 *mtt_entry; @@ -295,6 +295,84 @@ out: return err; } +void mthca_tavor_write_mtt_seg(struct mthca_dev *dev, struct mthca_mtt *mtt, + int start_index, u64 *buffer_list, int list_len) +{ + u64 __iomem *mtts; + u32 mtt_seg; + int i; + + mtt_seg = mtt-first_seg * MTHCA_MTT_SEG_SIZE; + mtts = dev-mr_table.tavor_fmr.mtt_base + mtt_seg + start_index * sizeof (u64); + for (i = 0; i list_len; ++i) { + __be64 mtt_entry = cpu_to_be64(buffer_list[i] | + MTHCA_MTT_FLAG_PRESENT); + mthca_write64_raw(mtt_entry, mtts + i); + } +} + +void mthca_arbel_write_mtt_seg(struct mthca_dev *dev, struct mthca_mtt *mtt, + int start_index, u64 *buffer_list, int list_len) +{ + __be64 *mtts; + int i; + int s = start_index * sizeof (u64); + + /* For Arbel, all MTTs must fit in the same page. */ + BUG_ON(s / PAGE_SIZE != (s + list_len * sizeof(u64) - 1) / PAGE_SIZE); + /* Require full segments */ + BUG_ON(s % MTHCA_MTT_SEG_SIZE); + + mtts = mthca_table_find(dev-mr_table.mtt_table, mtt-first_seg + + s / MTHCA_MTT_SEG_SIZE); + + BUG_ON(!mtts); + + for (i = 0; i list_len; ++i) + mtts[i] = cpu_to_be64(buffer_list[i] | MTHCA_MTT_FLAG_PRESENT); +} + +int mthca_write_mtt_size(struct mthca_dev *dev) +{ + if (dev-mr_table.fmr_mtt_buddy != dev-mr_table.mtt_buddy) + /* +* Be friendly to WRITE_MTT command +* and leave two empty slots for the +* index and reserved fields of the +* mailbox. +*/ + return PAGE_SIZE / sizeof (u64) - 2; + + /* For Arbel, all MTTs must fit in the same page. */ + return mthca_is_memfree(dev) ? (PAGE_SIZE / sizeof (u64)) : 0x7ff; +} + +int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, + int start_index, u64 *buffer_list, int list_len) +{ + int size = mthca_write_mtt_size(dev); + int chunk; + + if (dev-mr_table.fmr_mtt_buddy != dev-mr_table.mtt_buddy) + return __mthca_write_mtt(dev, mtt, start_index, buffer_list, list_len); + + while (list_len 0) { + chunk = min(size, list_len); + if (mthca_is_memfree(dev)) + mthca_arbel_write_mtt_seg(dev, mtt, start_index, + buffer_list, chunk); + else + mthca_tavor_write_mtt_seg(dev, mtt, start_index, + buffer_list, chunk); + + list_len-= chunk; + start_index += chunk; + buffer_list += chunk; + } + + return 0; +} + static inline u32 tavor_hw_index_to_key(u32 ind) { return ind; Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c === --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_provider.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1015,6 +1015,7 @@ static struct ib_mr *mthca_reg_user_mr(s int
[openib-general] [PATCH 3 of 4] IB/mthca: fix non-cache-coherent CPUs with memfree
Fix non-cache-coherent CPUs with memfree HCAs. We allocate the MTT table with alloc_pages() and then do pci_map_sg(), so we must call pci_dma_sync_sg after the CPU writes to the MTT table (this works since device never writes the MTTs on memfree). For MPTs, both the device and CPU might write there, so we must allocate dma coherent memory for these. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] --- Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_memfree.c === --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_memfree.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_memfree.c @@ -35,6 +35,8 @@ */ #include linux/mm.h +#include linux/scatterlist.h +#include asm/page.h #include mthca_memfree.h #include mthca_dev.h @@ -58,22 +60,31 @@ struct mthca_user_db_table { }page[0]; }; -void mthca_free_icm(struct mthca_dev *dev, struct mthca_icm *icm) +void mthca_free_icm(struct mthca_dev *dev, struct mthca_icm *icm, int coherent) { struct mthca_icm_chunk *chunk, *tmp; + void *buf; int i; if (!icm) return; list_for_each_entry_safe(chunk, tmp, icm-chunk_list, list) { - if (chunk-nsg 0) - pci_unmap_sg(dev-pdev, chunk-mem, chunk-npages, -PCI_DMA_BIDIRECTIONAL); - - for (i = 0; i chunk-npages; ++i) - __free_pages(chunk-mem[i].page, -get_order(chunk-mem[i].length)); + if (coherent) + for (i = 0; i chunk-npages; ++i) { + buf = lowmem_page_address(chunk-mem[i].page); + dma_free_coherent(dev-pdev-dev, chunk-mem[i].length, + buf, sg_dma_address(chunk-mem[i])); + } + else { + if (chunk-nsg 0) + pci_unmap_sg(dev-pdev, chunk-mem, chunk-npages, +PCI_DMA_BIDIRECTIONAL); + + for (i = 0; i chunk-npages; ++i) + __free_pages(chunk-mem[i].page, +get_order(chunk-mem[i].length)); + } kfree(chunk); } @@ -81,12 +92,41 @@ void mthca_free_icm(struct mthca_dev *de kfree(icm); } +static int mthca_alloc_icm_pages(struct scatterlist *mem, int order, gfp_t gfp_mask) +{ + mem-page = alloc_pages(gfp_mask, order); + if (!mem-page) + return -ENOMEM; + + mem-length = PAGE_SIZE order; + mem-offset = 0; + return 0; +} + +static int mthca_alloc_icm_coherent(struct device *dev, struct scatterlist *mem, + int order, gfp_t gfp_mask) +{ + void *buf = dma_alloc_coherent(dev, PAGE_SIZE order, sg_dma_address(mem), + gfp_mask); + if (!buf) + return -ENOMEM; + + sg_set_buf(mem, buf, PAGE_SIZE order); + BUG_ON(mem-offset); + sg_dma_len(mem) = PAGE_SIZE order; + return 0; +} + struct mthca_icm *mthca_alloc_icm(struct mthca_dev *dev, int npages, - gfp_t gfp_mask) + gfp_t gfp_mask, int coherent) { struct mthca_icm *icm; struct mthca_icm_chunk *chunk = NULL; int cur_order; + int ret; + + /* We use sg_set_buf for coherent allocs, which assumes low memory */ + BUG_ON(coherent (gfp_mask __GFP_HIGHMEM)); icm = kmalloc(sizeof *icm, gfp_mask ~(__GFP_HIGHMEM | __GFP_NOWARN)); if (!icm) @@ -112,21 +152,28 @@ struct mthca_icm *mthca_alloc_icm(struct while (1 cur_order npages) --cur_order; - chunk-mem[chunk-npages].page = alloc_pages(gfp_mask, cur_order); - if (chunk-mem[chunk-npages].page) { - chunk-mem[chunk-npages].length = PAGE_SIZE cur_order; - chunk-mem[chunk-npages].offset = 0; + if (coherent) + ret = mthca_alloc_icm_coherent(dev-pdev-dev, + chunk-mem[chunk-npages], + cur_order, gfp_mask); + else + ret = mthca_alloc_icm_pages(chunk-mem[chunk-npages], + cur_order, gfp_mask); - if (++chunk-npages == MTHCA_ICM_CHUNK_LEN) { + if (!ret) { + ++chunk-npages; + + if (!coherent chunk-npages == MTHCA_ICM_CHUNK_LEN) { chunk-nsg = pci_map_sg(dev-pdev, chunk-mem
[openib-general] [PATCH 4 of 4] IB/mthca: give reserved MTTs a separate cache line
This fixes several issues related to reserved MTTs and memory alignment. 1. MTTs are allocated in non-cache-coherent memory, so we must give reserved MTTs their own cache line, to prevent both device and CPU from writing into the same cache line at the same time. 2. reserved_mtts field has different meaning in Tavor and Arbel, so we are wasting mtt entries on memfree. Fix the Arbel case to match Tavor semantics. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] --- Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_main.c === --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_main.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_main.c @@ -464,6 +464,10 @@ static int mthca_init_icm(struct mthca_d goto err_unmap_aux; } + /* CPU writes to non-reserved MTTs, while HCA might DMA to reserved mtts */ + mdev-limits.reserved_mtts = ALIGN(mdev-limits.reserved_mtts * MTHCA_MTT_SEG_SIZE, + dma_get_cache_alignment()) / MTHCA_MTT_SEG_SIZE; + mdev-mr_table.mtt_table = mthca_alloc_icm_table(mdev, init_hca-mtt_base, MTHCA_MTT_SEG_SIZE, mdev-limits.num_mtt_segs, Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_cmd.c === --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_cmd.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -1051,7 +1051,11 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EQ_OFFSET); dev_lim-max_eqs = 1 (field 0x7); MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MTT_OFFSET); - dev_lim-reserved_mtts = 1 (field 4); + if (mthca_is_memfree(dev)) + dev_lim-reserved_mtts = ALIGN((1 (field 4)) * sizeof(u64), + MTHCA_MTT_SEG_SIZE) / MTHCA_MTT_SEG_SIZE; + else + dev_lim-reserved_mtts = 1 (field 4); MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET); dev_lim-max_mrw_sz = 1 field; MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MRW_OFFSET); -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] integer overflow
Roland, the following code in ipoib: while ((int) priv-tx_tail - (int) priv-tx_head 0) { seems to rely on integer overflow which seems to be undefined behaviour. Should we care? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] integer overflow
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: integer overflow while ((int) priv-tx_tail - (int) priv-tx_head 0) { seems to rely on integer overflow which seems to be undefined behaviour. tx_tail and tx_head are unsigned, and overflow is defined for unsigned integers. Yes but we cast them to signed int here - no? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH RFC] use common cq for ipoib cm send side
The following untested patch moves all TX processing in IPoIB CM to common CQ. This should help reduce the number of interrupts for bi-directional traffic (such as TCP). Is this a good idea? What do others think? Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] --- diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index eb885ee..ef703c7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -99,9 +99,9 @@ enum { #defineIPOIB_OP_RECV (1ul 31) #ifdef CONFIG_INFINIBAND_IPOIB_CM -#defineIPOIB_CM_OP_SRQ (1ul 30) +#defineIPOIB_OP_CM (1ul 30) #else -#defineIPOIB_CM_OP_SRQ (0) +#defineIPOIB_OP_CM (0) #endif /* structs */ @@ -144,7 +144,6 @@ struct ipoib_cm_rx { struct ipoib_cm_tx { struct ib_cm_id *id; - struct ib_cq*cq; struct ib_qp*qp; struct list_head list; struct net_device *dev; @@ -233,6 +232,7 @@ struct ipoib_dev_priv { unsigned tx_tail; struct ib_sgetx_sge; struct ib_send_wrtx_wr; + unsigned tx_outstanding; struct ib_wc ibwc[IPOIB_NUM_WC]; @@ -439,6 +439,7 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx); void ipoib_cm_skb_too_long(struct net_device* dev, struct sk_buff *skb, unsigned int mtu); void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc); +void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc); #else struct ipoib_cm_tx; @@ -527,6 +528,9 @@ static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *w { } +static inline void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) +{ +} #endif #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 8ee6f06..47c868c 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -85,7 +85,7 @@ static int ipoib_cm_post_receive(struct net_device *dev, int id) struct ib_recv_wr *bad_wr; int i, ret; - priv-cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; + priv-cm.rx_wr.wr_id = id | IPOIB_OP_CM | IPOIB_OP_RECV; for (i = 0; i IPOIB_CM_RX_SG; ++i) priv-cm.rx_sge[i].addr = priv-cm.srq_ring[id].mapping[i]; @@ -346,7 +346,7 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space, void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc-wr_id ~IPOIB_CM_OP_SRQ; + unsigned int wr_id = wc-wr_id ~(IPOIB_OP_CM | IPOIB_OP_RECV); struct sk_buff *skb; struct ipoib_cm_rx *p; unsigned long flags; @@ -433,7 +433,7 @@ static inline int post_send(struct ipoib_dev_priv *priv, priv-tx_sge.addr = addr; priv-tx_sge.length = len; - priv-tx_wr.wr_id = wr_id; + priv-tx_wr.wr_id = wr_id | IPOIB_OP_CM; return ib_post_send(tx-qp, priv-tx_wr, bad_wr); } @@ -484,20 +484,19 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ dev-trans_start = jiffies; ++tx-tx_head; - if (tx-tx_head - tx-tx_tail == ipoib_sendq_size) { + if (++priv-tx_outstanding == ipoib_sendq_size) { ipoib_dbg(priv, TX ring 0x%x full, stopping kernel net queue\n, tx-qp-qp_num); netif_stop_queue(dev); - set_bit(IPOIB_FLAG_NETIF_STOPPED, tx-flags); } } } -static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx, - struct ib_wc *wc) +void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc-wr_id; + struct ipoib_cm_tx *tx = wc-qp-qp_context; + unsigned int wr_id = wc-wr_id ~IPOIB_OP_CM; struct ipoib_tx_buf *tx_req; unsigned long flags; @@ -522,11 +521,10 @@ static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx spin_lock_irqsave(priv-tx_lock, flags); ++tx-tx_tail; - if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, tx-flags)) - tx-tx_head - tx-tx_tail = ipoib_sendq_size 1) { - clear_bit(IPOIB_FLAG_NETIF_STOPPED, tx-flags); + if (unlikely(--priv-tx_outstanding == ipoib_sendq_size 1) + netif_queue_stopped(dev) + test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags)) netif_wake_queue(dev); - } if (wc-status != IB_WC_SUCCESS wc-status != IB_WC_WR_FLUSH_ERR) { @@ -551,8 +549,17 @@ static
Re: [openib-general] please pull for 2.6.21: fix + add IB multicast support
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: please pull for 2.6.21: fix + add IB multicast support I merged the increment port number and remove redundant '_wq' patches from git.openfabrics.org/~shefty/scm/rdma-dev.git for-roland I plan to review to multicast stuff next week and I hope to merge it for 2.6.21. Or, have you or anyone else at Voltaire read over the code in addition to using it? Do you see anything that should be cleaned up? I looked at the code briefly, don't have much time at the moment unfortunately. +static void join_group(struct mcast_group *group, struct mcast_member *member, + u8 join_state) +{ + member-state = MCAST_MEMBER; + adjust_membership(group, join_state, 1); + group-rec.join_state |= join_state; + member-multicast.rec = group-rec; + member-multicast.rec.join_state = join_state; + list_del(member-list); + list_add(member-list, group-active_list); +} Can be just list_move. Patch allocates everything with kzalloc, but then goes ahead and initialize everything. So just kmalloc it - no reason to waste initialized memory if non-initialized will do. List of places: + member = kzalloc(sizeof *member, gfp_mask); + if (!member) + return ERR_PTR(-ENOMEM); Same here: + group = kzalloc(sizeof *group, gfp_mask); + if (!group) + return NULL; + and same here: + iter = kzalloc(sizeof *iter + attr_size, GFP_KERNEL); + if (!iter) + return ERR_PTR(-ENOMEM); + It seems same goes for + mc = kzalloc(sizeof(*mc), GFP_KERNEL); + if (!mc) + return NULL; in ucma.c - everything gets initied by calling function - but a bit less sure, needs checking. By the way, it seems same goes for + bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL); + if (!bind_list) + return -ENOMEM; in cma_alloc_any_port in the port randomization patch that was merged and for cma_alloc_port in existing code. -- MSTYou seem to be careful to do list_del_init for member-list all over, ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] please pull for 2.6.21: fix + add IB multicast support
Or, have you or anyone else at Voltaire read over the code in addition to using it? Do you see anything that should be cleaned up? OK, I most the the review i did (and interaction with Sean to add changes) was on the rdma_cm: add multicast communication support patch, and i was less focused on the ib_sa: track multicast join/leave requests patch, however i recall that there were some discussions between Sean and Michael and they reached an agreement. Yes, we reached an agreement. These patches have also seen some limited testing in the OFED tree. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH 0/5] iw_cxgb3 - misc cleanup and fixes
Quoting r. Steve Wise [EMAIL PROTECTED]: Subject: Re: [PATCH 0/5] iw_cxgb3 - misc cleanup and fixes On Fri, 2007-02-09 at 08:51 +0200, Michael S. Tsirkin wrote: Also I agree with MST, I would like to see the core/ subdirectory die completely. ok ok...I'll kill the subdir... It's not just the directory BTW. Stuff like building completions in t3_cqe format and then reformatting to ib_wc seems to be much more confusing (and some of it is actually on datapath). The t3_cqe format is built BY THE HW. I understand, I did not get that. But for example create_read_req_cqe builds it in software. It could build ib_wc instead. ... Having to wade through 3 driver-specific layers of abstractions just because I want to for example change API in ib_verbs.h and need to update all drivers will be very taxing. I understand your design calls for 2 layers, but at least the API exposed by code in drivers/net is fairly small, while cxio_wr.h declares 27 structures which seem to just duplicate ib_verbs.h. cxio_wr.h is hw format. You want me to change ib_verbs.h to make WRs and CQEs align with Chelsio hardware? No, but it need not be part of interface. The reason I was confused is because you seem to create an extra copy e.g. for t3_cqe. cxio_poll_cq currently creates an intermediate copy of the completion on the stack, I think it could format ib_wc directly instead. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Open MPI rpmbuild fails in OFED-1.2
Quoting Jeff Squyres [EMAIL PROTECTED]: Subject: Re: Open MPI rpmbuild fails in OFED-1.2 On Feb 7, 2007, at 2:49 PM, Michael S. Tsirkin wrote: My $0.02: This is another in a growing list of issues reflecting the whole build everything in DESTDIR is a problematic approach. I don't know much about RPM, and I am not exactly sure why are our source RPMs so complicated. It's a combination of two things: 1) similar to what you said below, we have lots of software packages that are all dependent upon each other -- this leads to a conglomeration of rpath's and shared library dependencies that are incorrect 2) we're trying to *use* the software when it is installed in the DESTDIR -- this means that you have to put special-case in the software so that they look for support files in both the DESTDIR *and* the final installation directory How do you mean, use? Hmm. I guess my question is - this works fine when I run OFED's configure script, why is SRPM so much more difficult? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] IB/ipoib_cm: fix up issues from code review
The following lightly tested patch addresses Roland's comments on IPoIB CM. Applies on top of PATCHv6: - Randomise RQ PSN - Fix for modular IPv6 - MTU endian-ness fix for ICMPs - Cosmetics Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] --- Roland, do you want me to report the full fixed-up patch instead? Pls let me know when IPoIB CM is in for-2.6.21, I'll switch to that for my testing. diff --git a/drivers/infiniband/ulp/ipoib/Kconfig b/drivers/infiniband/ulp/ipoib/Kconfig index 0ffca11..af78ccc 100644 --- a/drivers/infiniband/ulp/ipoib/Kconfig +++ b/drivers/infiniband/ulp/ipoib/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_IPOIB tristate IP-over-InfiniBand - depends on INFINIBAND NETDEVICES INET + depends on INFINIBAND NETDEVICES INET (IPV6 || IPV6=n) ---help--- Support for the IP-over-InfiniBand protocol (IPoIB). This transports IP packets over InfiniBand so you can use your IB diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 8082d50..eb885ee 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -127,7 +127,6 @@ struct ipoib_tx_buf { u64 mapping; }; -#ifdef CONFIG_INFINIBAND_IPOIB_CM struct ib_cm_id; struct ipoib_cm_data { @@ -181,7 +180,6 @@ struct ipoib_cm_dev_priv { struct ib_recv_wr rx_wr; }; -#endif /* * Device private locking: tx_lock protects members used in TX fast * path (and we use LLTX so upper layers don't do extra locking). diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index e7e7cc0..8ee6f06 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -37,7 +37,7 @@ #include net/dst.h #include net/icmp.h -#ifdef CONFIG_IPV6 +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) #include linux/icmpv6.h #endif @@ -170,7 +170,8 @@ static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev, } static int ipoib_cm_modify_rx_qp(struct net_device *dev, - struct ib_cm_id *cm_id, struct ib_qp *qp) + struct ib_cm_id *cm_id, struct ib_qp *qp, + unsigned psn) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -193,7 +194,7 @@ static int ipoib_cm_modify_rx_qp(struct net_device *dev, ipoib_warn(priv, failed to init QP attr for RTR: %d\n, ret); return ret; } - qp_attr.rq_psn = 0 /* FIXME */; + qp_attr.rq_psn = psn; ret = ib_modify_qp(qp, qp_attr, qp_attr_mask); if (ret) { ipoib_warn(priv, failed to modify QP to RTR: %d\n, ret); @@ -203,7 +204,8 @@ static int ipoib_cm_modify_rx_qp(struct net_device *dev, } static int ipoib_cm_send_rep(struct net_device *dev, struct ib_cm_id *cm_id, -struct ib_qp *qp, struct ib_cm_req_event_param *req) +struct ib_qp *qp, struct ib_cm_req_event_param *req, +unsigned psn) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_cm_data data = {}; @@ -219,7 +221,7 @@ static int ipoib_cm_send_rep(struct net_device *dev, struct ib_cm_id *cm_id, rep.target_ack_delay = 20; /* FIXME */ rep.srq = 1; rep.qp_num = qp-qp_num; - rep.starting_psn = 0 /* FIXME */; + rep.starting_psn = psn; return ib_send_cm_rep(cm_id, rep); } @@ -229,6 +231,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_cm_rx *p; unsigned long flags; + unsigned psn; int ret; ipoib_dbg(priv, REQ arrived\n); @@ -243,11 +246,12 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even goto err_qp; } - ret = ipoib_cm_modify_rx_qp(dev, cm_id, p-qp); + psn = random32() 0xff; + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p-qp, psn); if (ret) goto err_modify; - ret = ipoib_cm_send_rep(dev, cm_id, p-qp, event-param.req_rcvd); + ret = ipoib_cm_send_rep(dev, cm_id, p-qp, event-param.req_rcvd, psn); if (ret) { ipoib_warn(priv, failed to send REP: %d\n, ret); goto err_rep; @@ -742,7 +746,7 @@ static int ipoib_cm_send_req(struct net_device *dev, req.retry_count = 0; /* RFC draft warns against retries */ req.rnr_retry_count = 0; /* RFC draft warns against retries */ req.max_cm_retries= 15; - req.srq = 15; + req.srq = 1; return ib_send_cm_req(id, req); } @@ -1041,7 +1045,7 @@ static void ipoib_cm_skb_reap(struct work_struct *work
Re: [openib-general] more comments on cxgb3
- It seems that by passing in huge resource sizes, userspace will be able to drink up unlimited amounts of kernel memory. mthca handles this by using the mlock rlimit, should something be done here as well? Hmm. That's a good point. I'll put this on the todo as well. So mthca adds to process's rlimit value as things are allocated out of kernel memory (or maybe even anything that gets pinned)? Yes. The code is actually in uverbs core, mthca uses that. - I wonder about the names like get_mhp - what does mhp mean? static inline struct iwch_mr *get_mhp(struct iwch_dev *rhp, u32 mmid) { return idr_find(rhp-mmidr, mmid); } Memory Handle Pointer. hmm, what's a Handle? Maybe a better name can be found. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [Fwd: Re: win related [was: Re: [PATCH 1/2] opensm: sigusr1: syslog() fixes]]
Well, the way I see it one can take code from the Linux part under the BSD licance and use it in The windows part. The otherway around seems fine to me but some say that since the windows BSD liscance Reqires that some text will always remain there, the other way around is not possibale. As I'm not an Expert in that erea I don't know who is right. Interesting. Where does this idea come from? AFAIK BSD license is well known to be GPL-compatible, so there should be no problem moving code in either direction. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [Fwd: Re: win related [was: Re: [PATCH 1/2] opensm: sigusr1: syslog() fixes]]
Quoting r. Michael S. Tsirkin [EMAIL PROTECTED]: Subject: Re: [Fwd: Re: win related [was: Re: [PATCH 1/2] opensm: sigusr1: syslog() fixes]] Well, the way I see it one can take code from the Linux part under the BSD licance and use it in The windows part. The otherway around seems fine to me but some say that since the windows BSD liscance Reqires that some text will always remain there, the other way around is not possibale. As I'm not an Expert in that erea I don't know who is right. Interesting. Where does this idea come from? Maybe this? http://www.gnu.org/philosophy/bsd.html Note that openib license does not include the advertising clause. AFAIK BSD license is well known to be GPL-compatible, so there should be no problem moving code in either direction. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib_cm: fix up issues from code review
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: [PATCH] IB/ipoib_cm: fix up issues from code review OK, I pulled this in and fixed it to build with the netdevice class_device-ectomy that just went upstream, and pushed it out on my for-2.6.21 branch like this. Thanks! -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH 0/5] iw_cxgb3 - misc cleanup and fixes
Also I agree with MST, I would like to see the core/ subdirectory die completely. ok ok...I'll kill the subdir... It's not just the directory BTW. Stuff like building completions in t3_cqe format and then reformatting to ib_wc seems to be much more confusing (and some of it is actually on datapath). Same goes for t3_wq and I suspect everything else defined in cxio_wr.h - please, use the native types from include/rdma/. Having to wade through 3 driver-specific layers of abstractions just because I want to for example change API in ib_verbs.h and need to update all drivers will be very taxing. I understand your design calls for 2 layers, but at least the API exposed by code in drivers/net is fairly small, while cxio_wr.h declares 27 structures which seem to just duplicate ib_verbs.h. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] [RFC] ofed_1_2 - SLES9SP3 Backport-IWCM workaroundfor ip_dev_find() bug.
Quoting Steve Wise [EMAIL PROTECTED]: Subject: Re: [openib-general] [PATCH] [RFC] ofed_1_2 - SLES9SP3 Backport-IWCM workaroundfor ip_dev_find() bug. Michael, From your email, it sounded like you would regression test this. Not yet, we had lab restructuring - hopefully next week. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH 0/5] iw_cxgb3 - misc cleanup and fixes
And here's what I have pending in for-2.6.21 so far: What about the mthca memory registration patches? I thought they are on their way. Should I repost? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib get net_device from ipoib_neigh instead of linux neighbour
Quoting Moni Shoua [EMAIL PROTECTED]: Subject: Re: [PATCH] IB/ipoib get net_device from ipoib_neigh instead of linux neighbour Another concern: assume that one device goes away (e.g. hotplug). It seems that neighbours whose dev field point to another device, will not be destroyed. Correct? I agree. Therefore in your design, it seems that to_ipoib_neigh()-dev will get us a pointer to device that has been removed already. I agree that this is a problem. I think we can solve this if we track all ipoib neighbours, like we do for old kernels, and then flush ipoib neighbours on any hotplug event. Roland, does this sound too awful? It think it would be best to prevent an IPoIB device from disappearing or from ib_ipoib from being unloaded as long as IPoIB device is a slave. Unfortunately, I don't see how this can be done just by fixing something in bonding or IPoIB. So hotplug is blocked potentially forever? This does not sound good. However, any slave knows he has a master (dev-master). What do you think about a solution where IPoIB first tries to clean up the neighbours that belong to it's master before deleting the IPoIB device? How? Furthermore, bond_setup_by_slave is called only for non Ethernet devices (we consider to change the logic to called only for IPoIB devices just for safety). Why is this necessary, BTW? If we don't do that, we get a memory leak because the neigh destructor will never be called for non IPoIB devices although they carry ipoib_neigh with them. How can this happen? If it does, I think we are back to where we started: to_ipoib_neigh is broken for non-IPoIB device. I thought you said only devices of the same type can be paired? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCHv6 RFC] IPoIB CM Experimental support
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: [PATCHv6 RFC] IPoIB CM Experimental support Well, randomness is a resource after all, and since we don't have the additional security provided by PSNs in IPoIB UD, it seemed we do not need it for IPoIB CM either. So maybe the right thing is just to remove the FIXME comment. random32() doesn't use up any entropy. Random PSNs help avoid problems with stale connections, so I think we should do it. Well, stale connections don't pose any real problems for IPoIB CM - worst case a connnection is torn down and recreated. But I don't have a strong opinion anyway - that's why I put the FIXME there. So I'm OK with random32, too. I noticed some funny code in ipoib_cm_skb_reap(): __be32 mtu = cpu_to_be32(priv-mcast_mtu); // htonl(__be32)?? icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); // no htonl() here -- is this correct? icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu, dev); what is the right thing? Both are right I think. These two functions seem to accept parameters in different format: include/net/icmp.h:extern void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info); include/linux/icmpv6.h:extern voidicmpv6_send(struct sk_buff *skb, include/linux/icmpv6.h- int type, int code, include/linux/icmpv6.h- __u32 info, include/linux/icmpv6.h- struct net_device *dev); BTW, I just looked at ip_gre.c and it has the same code. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Open MPI rpmbuild fails in OFED-1.2
Quoting Jeff Squyres [EMAIL PROTECTED]: Subject: Re: Open MPI rpmbuild fails in OFED-1.2 My $0.02: This is another in a growing list of issues reflecting the whole build everything in DESTDIR is a problematic approach. I don't know much about RPM, and I am not exactly sure why are our source RPMs so complicated. However, with the plan configure/make we are able to build all openfabrics components within build directory, without any chroot tricks. So let's not give up yet, IMO it is very nice to be able to build in standard environment, without being root. Note that what is biting us here is mostly the large number of modules: simple single-module packages don't have this problem - and this is really a design decision we took. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCHv6 RFC] IPoIB CM Experimental support
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: [PATCHv6 RFC] IPoIB CM Experimental support I noticed some funny code in ipoib_cm_skb_reap(): __be32 mtu = cpu_to_be32(priv-mcast_mtu); // htonl(__be32)?? icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); // no htonl() here -- is this correct? icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu, dev); what is the right thing? Both are right I think. You're right -- the mistake is making mtu __be32 and preswapping it. I'll fix it up in my tree. Let me know when you push it out, I'll start testing it. These two functions seem to accept parameters in different format: include/net/icmp.h:extern void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info); include/linux/icmpv6.h:extern voidicmpv6_send(struct sk_buff *skb, include/linux/icmpv6.h- int type, int code, include/linux/icmpv6.h- __u32 info, include/linux/icmpv6.h- struct net_device *dev); BTW, I just looked at ip_gre.c and it has the same code. no, it leaves mtu as an int rather than swapping it. You are right of course. sparse would have found it. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] RFC ofed 1 2 kernel file structure
Repost. Could everyone please look at git://git.openfabrics.org/~mst/newofed.git and tell me whether this looks acceptable? Thanks, MST Quoting r. Michael S. Tsirkin [EMAIL PROTECTED]: Subject: Re: idea for ofed 1 2 kernel file structure Quoting Michael S. Tsirkin [EMAIL PROTECTED]: It would easy to split OFED specific files In separate directory and have OFED All out of tree modules we distribute would go there too. What do others think about this? OK, I didn't quite get whether the majority likes this or not, so I created such a repository, extracted the ofed specific history and imported it there. Take a look here: git://git.openfabrics.org/~mst/newofed.git Build scripts will have to be adjusted to add necessary kernel components that we use. Another nice thing about this layout, is that users (if they so wish) will be able to use just linux kernel source tarball instead of full linux kernel git. OFED maintainers, you are the primary users of the OFED git. Please comment which layout is better for you. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RFC ofed 1 2 kernel file structure
Quoting Sean Hefty [EMAIL PROTECTED]: Subject: Re: [openib-general] RFC ofed 1 2 kernel file structure Michael S. Tsirkin wrote: Repost. Could everyone please look at git://git.openfabrics.org/~mst/newofed.git and tell me whether this looks acceptable? I don't see anything listed for this off of the web site, and cloning it produces an empty tree. Pls try again now. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RFC ofed 1 2 kernel file structure
Quoting r. Hoang-Nam Nguyen [EMAIL PROTECTED]: Subject: Re: [openib-general] RFC ofed 1 2 kernel file structure Hi Michael, Repost. Could everyone please look at git://git.openfabrics.org/~mst/newofed.git and tell me whether this looks acceptable? I could clone it: $git clone git://git.openfabrics.org/~mst/newofed.git fatal: Unable to look up git.openfabrics.org (Temporary failure in name resolution) fetch-pack from 'git://git.openfabrics.org/~mst/newofed.git' failed. $git clone git://git.openfabrics.org/~mst/newofed.git fatal: Unable to look up git.openfabrics.org (Temporary failure in name resolution) fetch-pack from 'git://git.openfabrics.org/~mst/newofed.git' failed. I tried to use web git pointing to http://www.openfabrics.org/git/?p=~mst/newofed.git;a=tree and got this: 403 Forbidden - Reading tree failed Is there something else I need to pay attention of? Pls try again. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RFC ofed 1 2 kernel file structure
Quoting Bryan O'Sullivan [EMAIL PROTECTED]: Subject: Re: RFC ofed 1 2 kernel file structure Michael S. Tsirkin wrote: All, pls try now. This is similar in layout to the sort of tree we've used internally all along, so it's fine by me. One small problem: I don't like the combination of lower and upper case names of makefile and Makefile in the top-level directory. ofed_1_2 has the same. Also, it's no longer obvious to me to tell what kernel version the sources are pulled from. I used to be able to check the top-level Makefile or git history, but I no longer know what to look at. This will be part of BUILD_ID. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] more comments on cxgb3
OK, so I looked at cxgb3 some more. To summarise my previous comments, I think the cxio hal layer needs to go to make the code readable - if I understand correctly it is there for historical reasons only. I started looking at userspace/kernel interaction, and then went over to other code under cxgb3 (but not core/). - Consider a user that does e.g. create QP, but never calls mmap. Is there some code that will clean out the unclamed mmap object? I couldn't find it, and iwch_dealloc_ucontext does not seem to do anything with it. - Passing physical address to userspace and back looks suspicios. Especially this: uresp.physaddr = virt_to_phys(chp-cq.queue); Could you elaborate on the design here? What are these phy addresses and how come userspace needs to know the phy address? You are not doing DMA by this address, by any chance? - It seems that by passing in huge resource sizes, userspace will be able to drink up unlimited amounts of kernel memory. mthca handles this by using the mlock rlimit, should something be done here as well? A couple of comments on PDBG macro. - I'd like to suggest following the practice of prefixing macro names with module name (same goes for functions like get_mhp really) - unless they are local to file. - You are using __FUNCTION__ a lot - it might be to just to kill it, messages are unique so you'll be able to locate the msg source anyway, save some kernel text and logs will be shorter. In any case I think __func__ is the recommended gcc way to get the name currently. - comment near pr_debug definition in include/linux/kernel.h says: /* If you are writing a driver, please use dev_dbg instead */ so it might be a good idea for PDBG to follow this rule. - log messages do not look very informative to me. I also think they are a bit too many of them currently. For example, I do not think it is a good idea to print the kernel pointers out. For example, in code like the following: mhp = get_mhp(rhp, (sg_list[i].lkey) 8); if (!mhp) { PDBG(%s %d\n, __FUNCTION__, __LINE__); return -EIO; } might be better to say MR key XXX does not exist. Exiting.. -EIO also looks like a strange error code to return here, does it not? Maybe something like EINVAL would be more appropriate? - I wonder about the names like get_mhp - what does mhp mean? static inline struct iwch_mr *get_mhp(struct iwch_dev *rhp, u32 mmid) { return idr_find(rhp-mmidr, mmid); } Looks like it looks up an mr. Is that right? Maybe the name shouldbe changed to convey this meaning. - In the following code, what does missing pdid check mean? /* * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now. */ This sounds interesting. Does this mean the code does not validate the PD currently? I have the same question for: /* TBD: check perms */ in iwch_bind_mw. BTW, does TBD stand for To Be Done here? google says: Definitions of TBD on the Web: * To Be Determined, Defined, Decided. www.csr.com/ptot.htm * to be determined www.liberalsagainstterrorism.com/wiki/index.php/Counterinsurgency_Operations/Glossary * Treasury Board (Secretariat) www.psc-cfp.gc.ca/centres/definitions_and_notes_e.htm * The three letter abbreviation TBD may be/mean, depending on context: * an acronym for To Be Determined (...at a later point in time., typically)* the Douglas Devastator, a US Navy torpedo bomber of World War II en.wikipedia.org/wiki/TBD What is to be determined here? Do you mean TODO really? - iwch_sgl2pbl_map is used in several places, and seems a bit too big to be inline Well, it's time to go do my day job now :) Hope this helps, -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Backport and fix patches for ipath driver
Quoting Bryan O'Sullivan [EMAIL PROTECTED]: Subject: Backport and fix patches for ipath driver Hi, Vlad and Tziporet - Here's a round of fix and backport patches for the ipath driver, for dropping into the OFED 1.2 tree. The way in which they're organised should, I hope, be clear. Looks good, fixes look much cleaner than what we had for OFED 1.1. I think fixes can be applied already. However, I'm not sure the backports are ready to be applied as is yet. Just taking a look at random: ./backport/2.6.18/ipath-50-mad-kmem_cache-2.6.19.patch BACKPORT - kmem_cache_t disappeared after 2.6.19 diff -r a290ff6e9ae7 drivers/infiniband/core/mad.c --- a/drivers/infiniband/core/mad.c Wed Jan 31 14:47:02 2007 -0800 +++ b/drivers/infiniband/core/mad.c Wed Jan 31 14:48:00 2007 -0800 @@ -46,7 +46,7 @@ MODULE_AUTHOR(Hal Rosenstock); MODULE_AUTHOR(Hal Rosenstock); MODULE_AUTHOR(Sean Hefty); -static struct kmem_cache *ib_mad_cache; +static kmem_cache_t *ib_mad_cache; static struct list_head ib_mad_port_list; static u32 ib_mad_client_id = 0; This changes a core file, and does not seem to be related to ipath at all. What problem does this solve? I note that mad.c already seems to build fine on 2.6.18 for us - this is part of daily build. Another example that looks strange: BACKPORT - workqueues changed in 2.6.20 diff -r 8b94fcef1edd drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.cThu Feb 01 08:54:29 2007 -0800 +++ b/drivers/infiniband/hw/ipath/ipath_driver.cThu Feb 01 08:57:19 2007 -0800 @@ -241,7 +241,7 @@ static struct ipath_devdata *ipath_alloc dd-pcidev = pdev; pci_set_drvdata(pdev, dd); - INIT_DELAYED_WORK(dd-link_work, check_link_status); + INIT_WORK(dd-link_work, check_link_status); list_add(dd-ipath_list, ipath_dev_list); INIT_DELAYED_WORK is implemented in kernel_addons, so this should not be necessary. @@ -725,6 +725,7 @@ static void __devexit ipath_remove_one(s */ ipath_shutdown_device(dd); +#undef cancel_delayed_work cancel_delayed_work(dd-link_work); flush_scheduled_work(); This undef looks quite ugly. What does it do? Please go over the backport patches and check whether they are really necessary. I think you will mostly discover that the kernel_addons mechanism makes the backport patches unnecessary. If not, you should try adding things under kernel_addons as first choice so that everyone benefits. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] QoS in opensm will not be part of OFED 1.2
Quoting Michael S. Tsirkin [EMAIL PROTECTED]: Subject: Re: QoS in opensm will not be part of OFED 1.2 I had an AI to check the QoS status with OSM. Conclusions are that QoS support in OpenSM will not be part of OFED 1.2 (I updated the plan on the Wiki) The reasons for this are: 1. Code not ready at code freeze. 2. There are technical discussion in the list regarding some implementation details (e.g. XML or text syntax). 3. SPEC is not published by IBTA yet. I think this last reason also applies to the end client QoS changes as well. Yes. But the other 2 don't. Right but I think that precludes it from being included in OFED right now. Since the code is already included in OFED, moving it out would violate the feature freeze rules, unless there's an actual bug this would fix. OTOH, you are right in that without SM support we can't claim to have this feature at all. So, to avoid controversy, I have just removed the QoS patches from IB core and pushed the code out. I think that the mthca patch to encode SL in sched_queue field to improve hardware QoS guarantees for connected QPs is useful as this can be exercised by IPoIB-CM. If so, should/can this be included ? OK. Note this is still untested, and off by default. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] idea for ofed 1 2 kernel file structure
Quoting Michael S. Tsirkin [EMAIL PROTECTED]: Subject: idea for ofed 1 2 kernel file structure Hi! I looked a current ofed 1.2 kernel tree and there is 1 thing I dislike: It is hard to see changes that are specific to OFED since we have whole kernel history mixed in. It would easy to split OFED specific files In separate directory and have OFED scripts combine that with upstream kernel. All out of tree modules we distribute would go there too. What do others think about this? OK, I didn't quite get whether the majority likes this or not, so I created such a repository, extracted the ofed specific history and imported it there. Take a look here: git://git.openfabrics.org/~mst/newofed.git Build scripts will have to be adjusted to add necessary kernel components that we use. Another nice thing about this layout, is that users (if they so wish) will be able to use just linux kernel source tarball instead of full linux kernel git. OFED maintainers, you are the primary users of the OFED git. Please comment which layout is better for you. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib get net_device from ipoib_neigh instead of linux neighbour
-- IPoIB uses a two layer neighboring scheme, such that for each struct neighbour whose device is an ipoib one, there is a struct ipoib_neigh buddy which is created on demand at the tx flow by an ipoib_neigh_alloc(skb-dst-neighbour) call. When using the bonding driver, neighbours are created by the net stack on behalf of the bonding (master) device. On the tx flow the bonding code gets an skb such that skb-dev points to the master device, it changes this skb to point on the slave device and calls the slave hard_start_xmit function. Combing these two flows, there is a hole if some code at ipoib (ipoib_neigh_destructor) assumes that for each struct neighbour it gets, n-dev is an ipoib device so for example netdev_priv(n-dev) would be of type struct ipoib_dev_priv. Could you plese elaborate how ipoib_neigh_destructor comes to be called at all? At what point does ipoib_neigh_setup_dev get called? To fix it, this patch adds a dev field to struct ipoib_neigh which is used instead of the struct neighbour dev one. What I am concerned with is - if the master is not an IPoIB device, what guarantee do we have that to_ipoib_neigh will return 0 and not part of an actual hardware address? Without bonding, the reason is that dev points to an ipoib device, so we know hw address is 20 bytes. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] [RFC] ofed_1_2 - SLES9SP3 Backport - IWCM workaround for ip_dev_find() bug.
Quoting Sean Hefty [EMAIL PROTECTED]: Subject: Re: [PATCH] [RFC] ofed_1_2 - SLES9SP3 Backport - IWCM workaround for ip_dev_find() bug. Actually, yes it does. Here's one case (that I just tested :): If you rdma_bind() to an explicit address local address, it will fail. Foo! I guess I'll need to address the uses of ip_dev_find() in addr.c as well before we commit this. Can we just backport our own version of ip_dev_find()? We had this once before in svn when they removed it from being exported from the kernel. Yes, this is in kernel_addons for 2.6.19 or something like that. Just copy from there, much cleaner than the patch. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] [RFC] ofed_1_2 - SLES9SP3 Backport -IWCM workaround for ip_dev_find() bug.
Quoting Steve Wise [EMAIL PROTECTED]: Subject: Re: [openib-general] [PATCH] [RFC] ofed_1_2 - SLES9SP3 Backport -IWCM workaround for ip_dev_find() bug. Can we just backport our own version of ip_dev_find()? We had this once before in svn when they removed it from being exported from the kernel. Yes, this is in kernel_addons for 2.6.19 or something like that. Just copy from there, much cleaner than the patch. I just realized that ip_dev_find() is being redefined to xxx_ip_dev_find for sles9sp3. So maybe this function is causing the error. Stay tuned. xxx_ip_dev_find() is returning the wrong interface (sometimes). I added printks to xxx_ip_dev_find(). Then I ran rping -s -a local ip addr and it failed because xxx_ip_dev_find() returned loopback instead of my eth device. Here is the function with printks: static inline struct net_device *xxx_ip_dev_find(u32 addr) { struct net_device *dev; u32 ip; read_lock(dev_base_lock); printk(%s looking for dev with addr %x\n, __FUNCTION__, addr); for (dev = dev_base; dev; dev = dev-next) { ip = inet_select_addr(dev, 0, RT_SCOPE_LINK); printk(%s dev %p name %s ipaddr %x\n, __FUNCTION__, dev, dev-name, ip); if (ip == addr) { dev_hold(dev); break; } } read_unlock(dev_base_lock); return dev; } Here is the printk log showing loopback being returned: xxx_ip_dev_find looking for dev with addr 8846a8c0 xxx_ip_dev_find dev 804000e0 name lo ipaddr 8846a8c0 The address bound to eth3 is 192.168.70.136 (0xc0a84688). For some reason, this line: ip = inet_select_addr(dev, 0, RT_SCOPE_LINK); Returns the 192.168.70.136 address for device-name == lo. Riddle me that! Also, sometimes it works ok because the loopback interface gets some other ip address that is assigned to the local system as opposed to my rdma address. For example, I booted up the sles9sp3 system with a rebuilt kernel and no ofed modules installed. The system gets 10.10.0.136 via DHCP for its public interface. I then built the ofed modules and installed them. I then loaded them and configured my rnic interface with 192.168.70.136. I ran rping and bound to the local ipaddr and it worked. The log showed that inet_select_addr() returned 10.10.0.136 for loopback and thus xxx_ip_dev_find() continued walking the list and found the correct ethernet interface. I then rebooted and ran the test again and it failed. So somehow module load order affects this, I think. g. Try copying inet_select_addr source in from some upstream kernel, look at that. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] [RFC] ofed_1_2 - SLES9SP3 Backport -IWCM workaround for ip_dev_find() bug.
How shall I fix this? Patch? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] [RFC] ofed_1_2 - SLES9SP3 Backport -IWCM workaroundfor ip_dev_find() bug.
Quoting Steve Wise [EMAIL PROTECTED]: Subject: Re: [PATCH] [RFC] ofed_1_2 - SLES9SP3 Backport -IWCM workaroundfor ip_dev_find() bug. On Tue, 2007-02-06 at 23:36 +0200, Michael S. Tsirkin wrote: How shall I fix this? Patch? Riiight. I'm afraid if I use HOST instead of LINK that I'll break some strange SDP loopback feature or some such thing. And I'm not in a position to test that. But I can post a patch. Shall I just change sles9sp3 since we don't see (yet) any problems with the other distros? If you post one that updates all kernels it will be easier to test. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] [RFC] ofed_1_2 - SLES9SP3 Backport -IWCM workaroundfor ip_dev_find() bug.
Quoting Steve Wise [EMAIL PROTECTED]: Subject: Re: [PATCH] [RFC] ofed_1_2 - SLES9SP3 Backport -IWCM workaroundfor ip_dev_find() bug. Here it is (only tested with rping over iWARP on sles9sp3): xxx_ip_dev_find() must use scope HOST. From: Steve Wise [EMAIL PROTECTED] Function xxx_ip_dev_find(RT_SCOPE_LINK) returns the wrong interface on some kernels. The correct scope is RT_SCOPE_HOST. Signed-off-by: Steve Wise [EMAIL PROTECTED] OK. I don't have access to the lab at the moment, but hope to test this by next week. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general