Re: [ewg] Mellanox target workaround in SRP
I think that the patch is specific for srp initiator using Mellanox FMR. It tried to avoid indirect desc with Mellanox FMR having first-byte-offset != 0. Since the low level implementation of mlx4/mthca_map_phys_fmr() did not create + setup MPT for FMR with first_byte_offset != 0. The corruption can happen with any target. I don't think this could be right -- right now the workaround only triggers if the target has a Mellanox OUI, so if what you say is true, presumably everyone who is using the SRP initiator with mlx4 would be seeing this problem. Also, the SRP initiator code that uses ib_fmr_pool_map_phys does not pass in any non-aligned addresses -- it doesn't try to use any first byte offset, it just uses the virtual address it passes to the target to handle the offset. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Mellanox target workaround in SRP
Maybe we can use MST's current email to ask him... Michael, do you have any memory of the issue we worked around here? I have question regarding workaround introduced in commit 559ce8f1 of the mainline tree: IB/srp: Work around data corruption bug on Mellanox targets Data corruption has been seen with Mellanox SRP targets when FMRs create a memory region with I/O virtual address != 0. Add a workaround that disables FMR merging for Mellanox targets (OUI 0002c9). I don't see how this can make a difference to the target -- it sees an address and length, and there should be no visible difference to it when it gets an FMR versus a direct-mapped region of the same space, right? And how is it different than getting a direct or indirect descriptor with a similar offset? I could see there being a bug on the initiator HCA not liking such FMR mappings, but then it should be keyed off of the vendor of our HCA and not the target. I'm sure this was tested and shown to fix the problem; I'm just confused as to what the problem really was and if this is still relevant. Can someone please enlighten me? ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Mellanox target workaround in SRP
I'm sure this was tested and shown to fix the problem; I'm just confused as to what the problem really was and if this is still relevant. Can someone please enlighten me? At this point I'm afraid it's all lost in the mists of time, but the original patch seems to have come from http://lists.openfabrics.org/pipermail/general/2006-July/024322.html looking at the patch, I would guess that the corruption occurred when the target got an IO request that started at a non-page-aligned address but that spanned more than one page. I don't know if the target was ever fixed, or whether that target code has any relevance today. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] IPoIB to Ethernet routing performance
This may be due to the fact that the IB MTU is 2048. Every 1500 bytes packet is padded to 2048 bytes before being sent through the wire, so you're loosing roughly 25% bandwidth compared to an IPoIB MTU which is a multiple of 2048. This isn't true. IB packets are only padded to a multiple of 4 bytes. However there's no point in using IPoIB connected mode to pass packets smaller than the IB MTU -- you might as well use datagram mode. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH 00/10] Add fcoe, fcoib drivers for mlx4 device
Unfortunately it will probably be a while before I can look at this in any detail, but here are some obvious things to fix: - The actual FC/SCSI driver will need to be reviewed on linux-s...@vger.kernel.org of course; so please CC that list. - The ordering/split of patches seems off; in particular I notice that the patch that adds the new drivers into the build system comes before the patch that actually adds the new drivers, which is obviously wrong (since the kernel will be broken if we apply patch 9/10 but not 10/10). - None of the patch subjects contain a subject that says what part of the kernel they touch. For example, instead of [PATCH 01/10] pre-reserve MPTs for FC please write: [PATCH 01/10] mlx4_core: Pre-reserve MPTs for FC otherwise I waste time having to fix it up by hand. (And also if you want to do things exactly write, please capitalize the subject but do not put in a final '.' But I usually do not complain about that level of detail) - Every patch seems to have the changelog and diffstat twice. Please fix (and include a --- between the changelog and diffstat). If you prepare a clean git tree, then git format-patch will get everything exactly correct. - There seem to be a lot of whitespace and other formatting problems. Please run patches through checkpatch.pl and fix any real problems that it finds (0 warnings are not necessary since some warnings are false positives; but I suspect for these patches most won't be). -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH v2] libibverbs: ibv_fork_init() and libhugetlbfs
We thought about this too, but in some special cases, we do not know the correct page size of a memory range. For example when getting a 16M chunk from a 16M huge page region which is also aligned to 16M, the first madvise() will work fine and the code will assume that the page size is 64K. I see ... yes, that does break my idea completely. OK, another half-baked idea: what if we pay attention to when madvise(DOFORK) fails as well as well madvise(DONTFORK) fails, and use that as a hit that we better check the page size? Perhaps this adds too much complexity ... in which case your idea: As this issue was not present in version 2 of the code, but there we had a big performance penalty, I suggest the following: we could go back to version 2 and introduce a new RDMAV_HUGEPAGE_SAFE env variable to let the user decide between huge page support and better performance (the same approach we use for the COW protection itself). seems like a reasonable alternative. Thanks, Roland -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH v2] libibverbs: ibv_fork_init() and libhugetlbfs
When registering two memory regions A and B from within the same huge page, we will end up with one node in the tree which covers the whole huge page after registering A. When the second MR is registered, a node is created with the MR size rounded to the system page size (as there is no need to call madvise(), it is not noticed that MR B is part of a huge page). Now if MR A is deregistered before MR B, I see that the tree containing mem_nodes is empty afterwards, which causes problems for the deregistration of MR B, leaving the tree in a corrupted state with negative refcounts. This also breaks later registrations of other memory regions within this huge page. Good thing I didn't get around to applying the patch yet ;) I haven't thought this through fully, but it seems that maybe we could extend the madvise tracking tree to keep track of the page size used for each node in the tree. Then for the registration of MR B above, we would find the node for MR A covered MR B and we should be able to get the ref counting right. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH v4] IB Core: RAW ETH support
There is no qp type IBV_QPT_RAW_ETY in user space (at least not in the definitions coming with libibverbs). In fact, libibverbs that comes with OFED defines (in verbs.h) a qp type called IBV_QPT_RAW_ETT which equals to 7. The patch that is under discussion here adds a new qp type IB_QPT_RAW_ETH and equals it to 7 to match the definition in user space. This indeed changes the value of IB_QPT_RAW_ETY to 8 but I don't see who can be affected since 1. No user space program that uses IB_QPT_RAW_ETY exists 2. kernel is compiled as one piece of code. Why renumber the _ETY enum? Maybe it doesn't break anything serious but why risk it? -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH v4] IB Core: RAW ETH support
I tested it before on the Roland tree ( iboe branch ) and it fails, because it writen in the way suitable for OFED. If adapt the patch to the Roland tree, then appling Mellanox OFED patches will fail, because it changes the same functions in the code. Here is one example: Look at __mlx4_ib_modify_qp at the Roland tree - there is no RAW_ETY support. But in the OFED version of the same function this support is present. RAW_ETH patch modify this function and looking for RAW_ETY word and without this RAW_ETH Mellanox patch will fail. Don't take this too personally -- I picked a semi-random email in this thread to reply to; this is pretty broadly targeted. rant What the hell is the thinking behind introducing IB_QPT_RAW_ETH? You're inserting an enum value before IB_QPT_RAW_ETY, so any old userspace passing in IB_QPT_RAW_ETY will silently get different behavior depending on the kernel version. And you're creating two constands that differ in a single letter (IB_QPT_RAW_ETY vs. IB_QPT_RAW_ETH). How are you going to explain that to users? How is anyone ever going to get it right? For that matter, what exactly does IB_QPT_RAW_ETH mean? This all seems to be a symptom of how broken our development process is. Yes, unfortunately I can't spend as much time reviewing and applying patches as I might like, and I apologize for that. But if we have all the RDMA developers piling up shit in their little area and then sending it on to be merged as soon as it kind of works, without thinking about design or maintainability and without ever doing any review, then I'm always going to have an expanding review backlog. And then we have OFED compounding problems -- Oh that's a nice pile of shit you've built there. We better ship it to users while it's still steaming. How about if OFED developers take a little time to think things through? /rant In other words, can someone explain the plan for this raw QP stuff to me? - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH v2] libibverbs: ibv_fork_init() and libhugetlbfs
Thanks, nice work. I like this approach. Alex (Vainman) any comments on this? - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] Handling busy responses from the SA
The purpose of this patch is to cause the ib_mad driver to discard busy responses from the SA, effectively causing busy responses to become time outs. I don't have a strong opinion on this but it seems a bit odd. If we're just going to drop the response anyway, why did the SA send it in the first place? On the other hand, if the SA told us it's busy, it does seem we could do something more sensible than retrying immediately. Any opinions from anyone who worked on fabric scalability? +printk(KERN_NOTICE PFX Response returned with MAD_STATUS_BUSY\n); Do we want to spam kernel logs with this? Seems it could generate a lot of messages. +#define IB_MGMT_MAD_STATUS_SUCCESS 0x +#define IB_MGMT_MAD_STATUS_BUSY 0x0001 +#define IB_MGMT_MAD_STATUS_REDIRECT_REQD0x0002 +#define IB_MGMT_MAD_STATUS_BAD_VERERSION0x0004 +#define IB_MGMT_MAD_STATUS_UNSUPPORTED_METHOD 0x0008 +#define IB_MGMT_MAD_STATUS_UNSUPPORTED_METHOD_ATTRIB0x000c +#define IB_MGMT_MAD_STATUS_INVALID_ATTRIB_VALUE 0x001c The indentation of values seems pretty crazy here. Also I'm not sure what most of these defines are for? They seem unused in this patch. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Allowing ib dignostics to be run without being logged in as root.
$ cat /etc/udev/rules.d/80-ib-umad.rules KERNEL==umad*, NAME=infiniband/%k, MODE=0666 It is not the same. Your propose to expose /dev/infiniband/umad device access to all world, which is obviously even more dangerous than SUIDing diagnostic programs. Well, different threats. Making umad files world-writable means anyone can inject whatever MADs they want to into the fabric. On the other hand, if an arbitrary code execution security hole is found in a diagnostic program, then having it SUID root means the hole becomes a local root exploit. It's hard to assess which is really more dangerous. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH v2] libibverbs: ibv_fork_init() and libhugetlbfs
without patch: 1M memory region120usec 16M memory region 1970usec with patch v2: 1M memory region172usec 16M memory region 2030usec So if I read this correctly this patch introduces almost a 50% overhead in the 1M case... and probably much worse (as a fraction) in say the 64K or 4K case. I wonder if that's acceptable? Alex's original approach was to try the memadvise with normal page size and then fall back to huge page size if that failed. But of course that wastes some time on the failed madvise in the hugepage case. I think it would be interesting to compare timings for registering, say, 4K, 64K, 1M and 16M regions with and without huge page backing, for three possibilities: - unpatched libibverbs (will obviously fail on hugepage backed regions) - patched with this v2 - alternate patch that tries madvise and only does your /proc/pid/smaps - parsing if the first madvise fails. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Question: When should patches be submitted to EWG and when should they be submitted to linux-rdma?
The subject says it all. If I have a patch that can be applied against either the current OFED git repository or against the upstream kernel - where do I post it? What do you want to happen to the patch? If you want it applied to the upstream kernel, then send it to me and linux-rdma. If you want it applied to an OFED tree, send it to ewg. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 05/11] ib_core: IBoE UD packet packing support
I thought you were referring to the changes made by this patch 920d706. Should I re-send this patch? Yes, please work out what changes are still required and send a new patch. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 04/11] ib_core: IBoE CMA device binding
I'm having a hard time working out why the iboe case needs to schedule to a work queue here since its already in process context, right? It seems it would be really preferable to avoid all the extra pointer munging and reference counting, and just call things directly. I assume that the caller might attempt to acquire the same lock when calling join and in the callback. Specifically, ucma_join_multicast() calls rdma_join_multicast() with file-mut acquired and ucma_event_handler() does the same. I see... we can't call the consumer's callback directly since it might have locking assumptions. It would be nice if we didn't have this reference counting sometimes used and sometimes not used. I'll have to think about whether this can be made cleaner. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] mlx4_en: ping does not resume after failover
Also keep in mind that mlx4_en patches typically go through Davem and netdev, not my tree. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1
Also I'm wondering why you did the count stuff to ignore IBoE-only devices in multicast.c but not sa_query.c? It seems to me the right place to put this logic as the mutlicast code registers as an IB client and the add_one implemntation is multicast.c. Right, I'm not saying it shouldn't be in multicast.c. I'm just wondering why you don't have the same thing for sa_query.c. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1
One reason is that get_src_path_mask() may access ah_lock. I don't see that: static u8 get_src_path_mask(struct ib_device *device, u8 port_num) { struct ib_sa_device *sa_dev; struct ib_sa_port *port; unsigned long flags; u8 src_path_mask; sa_dev = ib_get_client_data(device, sa_client); if (!sa_dev) return 0x7f; so if we never add the sa_client structure it just returns. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 05/11] ib_core: IBoE UD packet packing support
Right, it's already applied. Doesn't seem to be all there... eg there is nothing for ethernet headers. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 07/11] ib_core: Add API to support IBoE from userspace
Basically, what I want to understand is why does this change make sense? @@ -1139,6 +1139,10 @@ struct ib_device { struct ib_grh *in_grh, struct ib_mad *in_mad, struct ib_mad *out_mad); +int(*get_eth_l2_addr)(struct ib_device *device, u8 port, + union ib_gid *dgid, int sgid_idx, + u8 *mac, u16 *vlan_id, u8 *tagged); + Yes, that was pretty much my original question. Why do we have a verb for userspace to call a device-specific method to do the mapping? The layering seems wrong somewhere if we have a generic verb to do this mapping, but then put the mapping in device-specific code. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [RFC] libibverbs: ibv_fork_init() and libhugetlbfs
* added get_huge_page_size() to read the huge page size from /proc/meminfo. This is done at ibv_fork_init() time. That doesn't work on systems that have multiple huge page sizes (eg powerpc). You can compare the code to get the size in libhugetlbfs. Also I think the munging through /proc/pid/maps doesn't really work. First of all, essentially grepping for libhugetlbfs is not robust as I mentioned -- this will break in the same way for apps using huge pages directly. And while it is nice to be able to tell if a range came from libhugetlbfs, I think there may be some bad performance impact from reading /proc/pid/maps line-by-line. (And by the way, as a trivial optimization, it would make sense to me to check the address of each map before doing the strstr). - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1
@@ -1017,9 +1020,12 @@ static void ib_sa_add_one(struct ib_device *device) sa_dev-end_port = e; for (i = 0; i = e - s; ++i) { +spin_lock_init(sa_dev-port[i].ah_lock); +if (rdma_port_link_layer(device, i + 1) != IB_LINK_LAYER_INFINIBAND) +continue; Not sure I understand why you move the initialization of the spinlock up here? It seems we ignore everything that might have to do with spinlock if this is an IBoE port. But the larger issue is what if someone calls one of the ib_sa_XXX_query functions on an IBoE port? Seems we just crash on uninitialized structures. I guess we're assuming that the kernel is smart enough not to do that? Also I'm wondering why you did the count stuff to ignore IBoE-only devices in multicast.c but not sa_query.c? - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 04/11] ib_core: IBoE CMA device binding
int ib_init_ah_from_path(struct ib_device *device, u8 port_num, - struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr) + struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr, + int force_grh) Rather than this change in API, could we just have this function look at the link layer, and if it's ethernet, then always set the GRH flag? Seems simpler than requiring the upper layers to do this and then pass the result in? -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 04/11] ib_core: IBoE CMA device binding
+static void iboe_mcast_work_handler(struct work_struct *work) +{ +struct iboe_mcast_work *mw = container_of(work, struct iboe_mcast_work, work); +struct cma_multicast *mc = mw-mc; +struct ib_sa_multicast *m = mc-multicast.ib; + +mc-multicast.ib-context = mc; +cma_ib_mc_handler(0, m); +kref_put(mc-mcref, release_mc); +kfree(mw); +} I'm having a hard time working out why the iboe case needs to schedule to a work queue here since its already in process context, right? It seems it would be really preferable to avoid all the extra pointer munging and reference counting, and just call things directly. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 05/11] ib_core: IBoE UD packet packing support
2. Fix wrong implementation of ib_ud_header_init(). A different patch was sent to Roland. This patch no longer applies, I guess because you already sent me this fix (which is now upstream since 2.6.34-rc1). -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 04/11] ib_core: IBoE CMA device binding
Multicast GIDs are always mapped to multicast MACs as is done in IPv6. Some helper functions are added to ib_addr.h. IPv4 multicast is enabled by translating IPv4 multicast addresses to IPv6 multicast as described in http://www.mail-archive.com/i...@sunroof.eng.sun.com/msg02134.html. I guess it's a bit unfortunate that the RoCE annex completely ignored how to map multicast GIDs to ethernet addresses (I suppose as part of the larger decision to ignore address resolution entirely). Anyway, looking at the email message you reference, it seems to be someone asking what the right way to map IPv4 multicast addresses to IPv6 addresses is. Is there a more definitive document you can point to? It seems that unfortunately the way the layering of addresses is done, there's no way to just use the standard mapping of IPv4 multicast addresses to Ethernet addresses (since IBoE is does addressing via the CMA mapping to GIDs followed by an unspecified mapping from GIDs to Ethernet addresses). - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 07/11] ib_core: Add API to support IBoE from userspace
Add ib_uverbs_get_eth_l2_addr() to allow ibv_create_ah() to resolve sgid, dgid to vlan, dmac for any gid type. Although user-space might bypass this call for link-local gids, it is better not to replicate the kernel resolution policy. Port link layer is also returned by ibv_query_port(). A high-level comment/question, followed by some notes about the specific patch. At the highest level, is having this very low-level command exposed as part of the kernel uverbs - userspace API the right place to split things? Making the Ethernet address resolution part of the low-level driver implies that it's not really a generic part of the verbs interface. Maybe it is generic, and we should have a generic function instead of calling into the low-level driver. I see the argument that we should keep the policy in the kernel, although I'm not sure how strong that argument is -- once we start shipping a kernel with a certain policy (and I guess OFED has in effect already done that), how could we ever change that policy? We'll have interoperability issues anyway, so it seems having userspace and kernel use different policies doesn't cause much further problems anyway. Or maybe it is device-specific, and we could wrap it up into the create AH uverbs call we already have? Low-level comments: +ssize_t ib_uverbs_get_eth_l2_addr(struct ib_uverbs_file *file, const char __user *buf, + int in_len, int out_len) +{ +struct ib_uverbs_get_eth_l2_addr cmd; +struct ib_uverbs_get_eth_l2_addr_resp resp; +int ret; +struct ib_pd*pd; + +if (out_len sizeof resp) +return -ENOSPC; + +if (copy_from_user(cmd, buf, sizeof cmd)) +return -EFAULT; + +pd = idr_read_pd(cmd.pd_handle, file-ucontext); +if (!pd) +return -EINVAL; + +ret = ib_get_eth_l2_addr(pd-device, cmd.port, (union ib_gid *)cmd.gid, + cmd.sgid_idx, resp.mac, resp.vlan_id, resp.tagged); +put_pd_read(pd); +if (!ret) { +if (copy_to_user((void __user *) (unsigned long) cmd.response, + resp, sizeof resp)) +return -EFAULT; This leaks kernel memory contents to userspace since the stack variable resp is never cleared. Also will cause problems if we ever need to use the reserved fields for anything. Also I'm not sure I understand why you pass the PD into this method? It seems you just use it to get a pointer to the device, but you already have that from the uverbs_file structure that's passed into all commands. +int ib_get_eth_l2_addr(struct ib_device *device, u8 port, union ib_gid *gid, + int sgid_idx, u8 *mac, __u16 *vlan_id, u8 *tagged) +{ +if (!device-get_eth_l2_addr) +return -ENOSYS; + +return device-get_eth_l2_addr(device, port, gid, sgid_idx, mac, vlan_id, tagged); +} +EXPORT_SYMBOL(ib_get_eth_l2_addr); I don't think we need this wrapper, since uverbs can just call the get_eth_l2_addr method directly (we already do that for quite a few other methods, eg alloc_ucontext is a uverbs-only method that has no in-kernel wrapper). Also the -ENOSYS test isn't necessary, since a device-driver shouldn't allow this method unless it actually implements it. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 08/11] mlx4: Allow interfaces to correspond to each other
+void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port) +{ +return mlx4_find_get_prot_dev(dev, proto, port); +} +EXPORT_SYMBOL(mlx4_get_prot_dev); Not sure I understand why you have a wrapper to call another function with exactly the same parameters? Can't we get rid of this and just rename mlx4_find_get_prot_dev() to mlx4_get_prot_dev()? -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 09/11] mlx4: Add support for IBoE - address resolution
+u8 mac_0_1[2]; +u8 mac_2_5[4]; This looks a bit odd. Any reason why you don't just have u8 mac[6]; in this structure? -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 06/11] ipoib: avoid ipoib over IBoE
OK, I applied this with just the first chunk. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [RFC] libibverbs: ibv_fork_init() and libhugetlbfs
When fork support is enabled in libibverbs, madvise() is called for every memory page that is registered as a memory region. Memory ranges that are passed to madvise() must be page aligned and the size must be a multiple of the page size. libibverbs uses sysconf(_SC_PAGESIZE) to find out the system page size and rounds all ranges passed to reg_mr() according to this page size. When memory from libhugetlbfs is passed to reg_mr(), this does not work as the page size for this memory range might be different (e.g. 16Mb). So libibverbs would have to use the huge page size to calculate a page aligned range for madvise. Yes, Alex Vainman reaised this same issue a while ago. The patch below demonstrates a possible solution for this. It parses the /proc/PID/maps file when registering a memory region and decides if the memory that is to be registered is part of a libhugetlbfs range or not. If so, a page size of 16Mb is used to align the memory range passed to madvise(). We see two problems with this: it is not a very elegant solution to parse the procfs file and the 16Mb are hardcoded currently. The latter point could be solved by calling gethugepagesize() from libhugetlbfs, which would add a new dependency to libibverbs. I think that we cannot assume huge pages only come from libhugetlbfs -- we should support an application directly enabling huge pages (possibly via another library too, so we can't assume that an application knows the page size for a memory range it is about to register). And also the 16 MB page size constant is of course not feasible -- with all due respect, the x86 page size of 2 MB is much more likely in practice :) (Although perhaps the much slower PowerPC TLB refill makes users more likely to try and use hugetlb pages ;) Alex suggested parsing files in the same way as libhugetlbfs does to get the page size, and that seems to be the best solution, since I don't think the libhugetlbfs license is compatible with the BSD license for libibverbs. But your trick of using /proc/*/maps looks nice. Does that only work for libhugetlbfs or can we recognize direct mmap of hugetlb pages? - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] mlx4_core: request MSIX vectors as much as there CPU cores
We found it in performance work of our EN (10G) driver What does this have to do with performance? Do you have a system where num_possible_cpus() differs significantly from num_online_cpus()? What kernel is that with? As far as I can tell, for x86 they should only be different if there genuinely are CPUs that are available for hotplug. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 03/11] IB/umad: Enable support only for IB ports
Why do we not allow umad for IBoE ports? I understand there's no QP0 but why can't userspace use QP1 just like for IB link layer ports? -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 01/11] ib core: Add link layer property to ports
Hi Eli, I'm hoping to get this IBoE stuff in for 2.6.35. I started an iboe branch in my tree (similar to the xrc branch I've been carrying for a while), and I added this patch in, except I renamed rdma_port_link_layer() to rdma_port_get_link_layer(). This seems to match rdma_node_get_transport() better. In any case as I add patches to my branch, you can stop worrying about them, which should make keeping this series updated easier. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 06/11] ipoib: avoid ipoib over IBoE
@@ -1383,6 +1385,9 @@ static void ipoib_remove_one(struct ib_device *device) dev_list = ib_get_client_data(device, ipoib_client); list_for_each_entry_safe(priv, tmp, dev_list, list) { +if (rdma_port_link_layer(device, priv-port) != IB_LINK_LAYER_INFINIBAND) +continue; Why do we need this chunk here? How could a netdev get on our list if we never create IPoIB interfaces for IBoE ports? - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1
@@ -795,11 +799,12 @@ static void mcast_add_one(struct ib_device *device) struct mcast_device *dev; struct mcast_port *port; int i; +int count = 0; if (rdma_node_get_transport(device-node_type) != RDMA_TRANSPORT_IB) return; -dev = kmalloc(sizeof *dev + device-phys_port_cnt * sizeof *port, +dev = kzalloc(sizeof *dev + device-phys_port_cnt * sizeof *port, @@ -1007,7 +1010,7 @@ static void ib_sa_add_one(struct ib_device *device) e = device-phys_port_cnt; } -sa_dev = kmalloc(sizeof *sa_dev + +sa_dev = kzalloc(sizeof *sa_dev + Do you happen to remember why you needed these kmalloc - kzalloc conversions? -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Version of OFED in RHEL6?
So what is the answer to the OP's original question though? Probably this is not really a question for this list, but surely, somebody here knows... Given the above statement about getting OFED from the upstream kernel rather than an OFA release, what does that mean in terms of which actual version of OFED will be in RHEL6? I think this whole discussion is kind of ridiculous. OFED is a distribution of RDMA software, with the kernel as one of its main upstream sources. So you could ask what version of the kernel a given OFED release is based on. But getting OFED from the upstream kernel is nonsense -- RHEL6 is getting RDMA drivers from the upstream kernel. So asking what version of OFED will be in RHEL6 is a bit like asking what version of SLES will be in RHEL6. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Version of OFED in RHEL6?
I think this whole discussion is kind of ridiculous. I'm the OP. It might be naive but I don't think it's ridiculous, Yes, I wasn't replying to your original question. Just the continuing discussion seemed silly -- once the answer RHEL6 won't contain OFED came, continuing to ask, OK, but what version of OFED will it contain? is ridiculous I think. OFED is a distribution of RDMA software, with the kernel as one of its main upstream sources. So you could ask what version of the kernel a given OFED release is based on. But getting OFED from the upstream kernel is nonsense -- RHEL6 is getting RDMA drivers from the upstream kernel. When I looked at an OFED 1.4 release, I saw a ton of stuff that didn't look like it had much to do with RDMA. For example, consider all the commands whose name begins with ib. Sorry, I was using RDMA as a generic term to cover both iWARP and InfiniBand. It's true that OFED contains a lot of stuff not strictly having to do with remote direct memory access. The reason I asked this in the first place is that the cluster software package I use, Rocks, needs help in how it supports IB. I was trying to do something about this. One approach would be to just use whatever comes with RHEL, but the talk mentioned below doesn't say anything about what version of OFED this will be. Getting up to speed with IB sure is difficult. Yes, this shows all the unfortunate confusion caused by trying to brand OFED as something special. What version of, say, the IP stack does RHEL contain? You don't care, right? In general, RHEL is going to contain version XYZ of the kernel, plus patches A, B, C, along with version UVW of libfoo and version MNO of libbar, etc. The InfiniBand/RDMA world shouldn't be special -- the rest of the world deals with that just fine, without having to tie RHEL's versioning to someone else's distribution. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] retry_cnt, rnr_retry and ibv_get_cq_event/ibv_req_notify_cq
If retry_cnt and rnr_retry are set to 7 (toggle the RETRY_COUNT #define at the top of the example code) as lots of documentation and examples suggest, I never get any events in my CQ. Setting it to another value, e.g. 6 causes the appropriate WCs to be generated. an rnr_retry of 7 means unlimited RNR retries -- ie if no receive is available at the receiver, then the sender will continue trying forever. For full details, you can look at the RNR NAK mechanism in the IB spec. Furthermore, I was wondering if there is a way to use ibv_get_cq_event/ibv_req_notify_cq to do a non-blocking poll for CQ events? I have the option in the attached code, but I've found that I miss some CQ events, for example, when posting 100 sends, I will only get back about 3--5 WCs containing error statuses. If I poll, I get all the WCs. There seems to be some confusion here between CQ events (which are single-shot notifications generated when a completion is enqueued into an armed CQ -- if another completion is added before notification is requested again, then another event is not generated). So for 100 work requests, you may not always get 100 CQ events, but you should always get 100 work completions enqueued on the CQ eventually. In any case, it is certainly fine to do poll(), epoll, SIGIO, whatever on the completion channel file descriptor. You can do fcntl to set it to nonblocking mode too. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue
good debugging, applied thanks. I do worry (as Moni mentioned) that this doesn't explain why you would get send failures in this case, but the patch itself is well-explained and looks obviously correct so I think we should apply it. -- Roland Dreier rola...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue
Sorry, I was referring to my patch not Eli's. Heh, I never would have said anything about your patch was obvious. I skimmed yours once but I do want to read it more carefully. Did you ever say what test case you are using to provoke the problem you're fixing? -- Roland Dreier rola...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] ib/ehca: fix in_wc handling in process_mad()
thanks, applied. -- Roland Dreier rola...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] [for-2.6.33] rdma/cm: revert associating an RDMA device when binding to loopback
OK, I'm planning on sending this upstream later today. Looks very small and simple, and then we can figure our what if anything we want to do for 2.6.34. Make sense for everyone? - R. -- Roland Dreier rola...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] bug 1918 - openmpi broken due to rdma-cm changes
That should be the patch in question. I'm not sure about reaching consensus. :) If the other changes to the rdma_cm aren't closely tied to that change, we may be able to back that one patch out until we can get whatever other fix may be needed. I'd like to do this approach. Then re-submit once we come to consensus... That makes sense to me. Someone please send me a tested revert. -- Roland Dreier rola...@cisco.com Cisco.com - http://www.cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] bug 1918 - openmpi broken due to rdma-cm changes
Is this only an iwarp issue? IE do all IB devices support hw loopback? And will all future devices support it (IE is it an IBTA requirement)? I do think IBA requires loopback to work. Can't quote chapter verse off the top of my head. -- Roland Dreier rola...@cisco.com Cisco.com - http://www.cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] EWG/OFED meeting minutes for Jan 25 2010
- Add - uMMU notifier kernel module - In general we wish to add it, Tziporet will try to find someone that can do it soon Please do *NOT* add ummunotify to OFED. The upstream kernel has explicitly rejected it (at least for the moment). It adds a huge maintenance and compatibility headache if OFED ships ummunotify and then upstream takes an incompatible version. Effort would be much better spent getting ummunotify or something like it into shape where upstream will merge it. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Does anybody have experience using Mellanox ConnectX cards with Ubuntu?
I haven't installed any drivers on this system as my understanding was that they are already included in the newer kernels, and although I don't have a lot of linux knowledge, it does appear that mlx4_core is running: You also need to load mlx4_ib, either modprobe mlx4_ib or add a line mlx4_ib to your /etc/modules. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Is this the right place for questions on OFED?
I'm trying to install OFED on Ubuntu 9.10 without much success. OFED 1.5 is failing with an error: Why do you want to install OFED? Ubuntu 9.10 already has pretty up-to-date IB/RDMA support included natively (eg aptitude install libibverbs1 etc). What are you missing? - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [PATCH] rdmaoe/libibverbs: handle binary compatibility
#define IB_USER_VERBS_MIN_ABI_VERSION 1 -#define IB_USER_VERBS_MAX_ABI_VERSION 6 +#define IB_USER_VERBS_MAX_ABI_VERSION 7 Whats this about? That seems like it needs a much bigger review, changing the kernel ABI version instantly breaks every existing libibverbs, shouldn't be done without alot of discussion!! Yes, I've been meaning to bring this up earlier. There was a time, long ago, when changing this ABI was maybe OK. I think even when I first designed the uverbs ABI, having this version instead of capability bits was a goof. But I don't have a time machine. In any case I think at this point we need to stick with ABI version 6 will full backwards compat unless we really really can't. And I don't think the changes here are nearly drastic enough that we can't figure out a way forward without breaking things. Why do you have IBV_LINK_LAYER_UNSPECIFIED ? That seems pointless. Why should existing kernels return UNSPECIFIED when we know they are always IB. I'd say drop IBV_LINK_LAYER_UNSPECIFIED and set IBV_LINK_LAYER_INFINIBAND to 0. I prefer having the UNSPECIFIED. Not a strong preference but my reasoning is that if you have an old kernel that doesn't answer anything, you're better off letting the app see that. And there's no reason why iWARP has to run over ethernet in principle. Maybe I'm wrong but I don't like setting don't know magically to IB behind the scenes. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] mlx4: Fix bug in mlx4_ib_mcg_attach
This bug doesn't seem to ever have been present in the upstream kernel -- what are you generating this patch against? - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] IPoIB: synchronize timer deletion with completion handler
When calling del_timer_sync on priv-poll_timer, it is necessary to prevent farther arming of the timer which can be done by a completion handler. Though it is harmless since the timer will eventually stop being rearmed, it is better practice to avoid calling the timer function after it is deleted. This patch handles this by using a new flag that is checked before arming the timer. have you seen this in practice? If it can happen then it's not harmless, since the module could be unloaded with the timer pending. However I don't see how it could happen, since we only seem to delete the timer after we know that no more completions are coming (except for the case where we decide that the hardware is wedged but it really only takes a *long* time to respond at exactly the wrong time, and we somehow get a completion between the del_timer_sync and the modify QP to reset state -- which is so unlikely it seems not worth adding this extra complexity for -- maybe we could add the del_timer_sync to after we delete the CQ or something if you're really worried) ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] mlx4: Fix bug in mlx4_ib_mcg_attach
This bug doesn't seem to ever have been present in the upstream kernel -- what are you generating this patch against? I think it came from your for-next branch. I don't see anything touching this code there. The patch that introduced this code upstream, 521e575b (IB/mlx4: Add support for blocking multicast loopback packets) doesn't have this bug and I don't see anything else that changed that area of the code. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [ofw] SC'09 BOF - Meeting notes
RFC 4291, Appendix A. Thanks for the pointer. As far as I can tell from reading some IPv6 stuff, it really is broken to try to go from a link-local IPv6 address back to a L2 ethernet address. For example, RFC 2464 (pointed to by RFC 4291) says: Ethernet Address The 48 bit Ethernet IEEE 802 address, in canonical bit order. This is the address the interface currently responds to, and may be different from the built-in address used to derive the Interface Identifier. It really seems to be setting ourselves up for trouble not to use neighbor discovery to map IPv6 addresses to link-layer addresses. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [ofw] SC'09 BOF - Meeting notes
In any case, this is not a correctness issue that prohibits experimentation with rdmaoe multicast on any network today. I agree -- nothing prevents experimentation. I am just leery about making invasive changes to the core stack in the absence of any documented design for IBoE (that I've seen at least). - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [ofw] SC'09 BOF - Meeting notes
How can 1500 lines out of 240k lines be a big change.. do I have these numbers right - is the big change you are referring too? If there are significant changes to the core APIs -- and IBoE has exactly this impact -- then yes it can be a big change even if the line count is small. What is the risk area that you are worried about .. do you think it will break current transports or existing ULPs ? I am worried that no one has thought through all the issues and corner cases around address resolution, multicast, etc, and that when we do get a standardized version of IBoE, we'll have to break core APIs yet again. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] RE: SC'09 BOF - Meeting notes and Final Slides
WinOF has asynchronous interfaces for modify QP, and limited testing has shown that it can improve connection times. QP transitions are probably the second largest component of connection setup after the SA. Since the RDMA CM already provides an asynchronous interface, even existing apps may be able to be able to gain some benefit from this. can't anyone get async modify QP today on any platform by just doing the operation in another thread (or thread pool)? It seems that the operations themselves are heavy enough that thread dispatch, locking etc is going to be significant overhead. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [ofw] SC'09 BOF - Meeting notes
Having lots of testing exposure can help in validating that all the edge cases are handled.. To some extent -- but there also needs to be some thinking involved to make sure that the interface can actually handle future cases. Are there a set of cases that you have in mind ? For example -- how is multicast going to interact with IGMP on ethernet switches? How is address resolution going to be done (current patches seem to assume that stateless IPv6 link-local addresses contain the ethernet address, which is not valid if RFC 3041 is used)? etc - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] mlx4: remove limitation on LSO header size
+ *blh = unlikely(halign 64) ? 1 : 0; This idiom of (boolean condition) ? 1 : 0 looks odd to me... doesn't (halign 64) already evaluate to 1 or 0 anyway? Does the unlikely() actually affect code generation here? True, (halign 64) is the same and is cleaner. As for the unlikely() -- well it's already been there and besides, we're never sure if it will improve anything so the same question could be asked for other places in the code. I was just wondering in this case where you are just assigning the boolean value of the expression to a variable how unlikely affects things. I can understand for conditional jumps how the compiler can choose to make the likely case more efficient, but when there are no jumps then I was just curious how the hint could help. I assume this initialization is to avoid a compiler warning. But the code is actually correct without initializing blh -- so I think that we save a tiny bit of code by doing uninitialized_var() instead? We must initialize blh since it is used for any send request and not just LSO opcodes. So then this patch was buggy because blh was not reinitialized as we loop through multiple work requests? eg an LSO request followed by a non-LSO request? Anyway I'll look over the newer patch. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] mlx4: remove limitation on LSO header size
+*blh = unlikely(halign 64) ? 1 : 0; This idiom of (boolean condition) ? 1 : 0 looks odd to me... doesn't (halign 64) already evaluate to 1 or 0 anyway? Does the unlikely() actually affect code generation here? With that said, see below... +int blh = 0; I assume this initialization is to avoid a compiler warning. But the code is actually correct without initializing blh -- so I think that we save a tiny bit of code by doing uninitialized_var() instead? +(blh ? cpu_to_be32(1 6) : 0); ...given that the only use of blh is as a flag to decide what constant to use here, does it generate better code to make blh be __be32 and set the value directly in build_lso_seg, ie do: *blh = unlikely(halign 64) ? cpu_to_be32(1 6) : 0; and then use blh without ?: in mlx4_ib_post_send... - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] rping is not resolving ipv6 addresses
I have been testing rping with ipv6 address on OFED 1.5. I am seeing a problem with ipv6 address resolution. rping is not always sending neighbor solicitation out the correct interface. Running tcpdumps I discovered that the neighbor solicitation is being sent out the first interface that was configured, in my case eth0 and not the ib0 interface. This is the test I am running: Host A: $ ip address show ib0 | grep inet6 inet6 fe80::202:c903:1:1925/64 scope link $ rping -s -v -a ::0 Host B: $ rping -c -v -a fe80::202:c903:1:1925 cma event RDMA_CM_EVENT_ADDR_ERROR, error -110 waiting for addr/route resolution state 1 Using tcpdump I verified the neighbor solicitation packets went out the eth0 interface not ib0. If I ifdown eth0 I see the neighbor solicitation go out ib0 and the rping works. BTW: Ping6 works ok; however, it requires that I specify the interface to use on the command line: $ ping6 fe80::202:c903:1:1925%ib0 I'm not sure how many places in the stack have to be fixed for this, but for this to work certainly at least rping needs a way to specify which interface to use. There is no routing for IPv6 link-local addresses that says which device to use... a given address with prefix fe80: could conceivably exist on multiple links (and correspond to different destinations on each link), and the networking stack has no way to know which link to use unless you tell it somehow. (This is why ping6 requires the %ib0 specifier) ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] IB/ehca: Fix CQE flags reporting
applied, thanks ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] IB/ehca: Construct MAD redirect replies from request MAD
this seems reasonable to me, applied, thanks. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect replies from request MAD
Given that you seem to like the rest of the code and Jason hasn't spoken up yet, I think we can have Roland merge this patch. Roland, what do you think? I don't see any problem with the idea and this does sound like a step forward, so I am planning on merging this (pending review). ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] mlx4_core: map sufficient ICM memory for EQs
Thanks, applied with a few cleanups: ilog2(roundup_pow_of_two()) - order_base_2() xxx * (1 yy) - xxx yy ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ofa-general] Re: [ewg] [Patch mthca backport] Don't use kmalloc 128k
And I don't think the upstream kernel has that limit on kmalloc size either (at least with SLUB, not sure about SLAB). This patch was actually written as an emulation of the upstream SLUB behavior, which is exactly the same thing: on large allocations forward to __g_f_p(). See include/linux/slub_def.h's definition of kmalloc_large and kmalloc. Right. But does upstream SLAB also pass through to the page allocator the same as SLUB? How about SLQB? - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ofa-general] Re: [ewg] [Patch mthca backport] Don't use kmalloc 128k
This will fix the 2^20 bits limit on our bitmaps once and for all. Not really... since getting 128KB of contiguous memory is likely to fail anyway. And I don't think the upstream kernel has that limit on kmalloc size either (at least with SLUB, not sure about SLAB). Really the long-term fix is to handle non-contiguous memory in the bitmap allocator. maybe using vmalloc(), although I always hate big allocations with vmalloc too. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] ehca: use port autodetect mode as default
looks good, applied ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] install.pl: Install QIB driver instead of Ipath
This patch installs the qib driver which replaces the ipath driver in OFED-1.5. Maybe I missed some discussion of this. But what is the QIB driver? What are you planning to do to support qlogic HCAs for the mainline kernel? - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [PATCH 2.6.31 try 2] ehca: Tolerate dynamic memory operations and huge pages
applied... ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [PATCH 2.6.31 try 2] ehca: Tolerate dynamic memory operations and huge pages
thanks, applied. -#define HCAD_VERSION 0026 +#define HCAD_VERSION 0027 the driver version was already 0027 (since bde2cfaf), so I dropped this chunk. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [PATCH 2/9] ib_core: kernel API for GID -- MAC translations
on top of which trees would like me to rebase the patches on and update the ABI version? I guess you could base it on my xrc branch but as I said I'm not planning on merging this until I'm done with the XRC stuff. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] [PATCH 2/9] ib_core: kernel API for GID --MAC translations
Hum, This is a very tricky subject. Co-mingling the IB GID address space and the IPv6 address space like this is not really something that was envisioned from the IBA side. Doesn't the IB spec say that an IB GID *is* an IPv6 address? So in theory it should be OK; however I don't think in practice anyone paid attention to making sure that the IB GID space works as an IPv6 address. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] [PATCH 2/9] ib_core: kernel API for GID --MAC translations
It is like an IPv6 address but it was expressly envisioned to be a seperate space. The IBA authors copied many of the conventions from IPv6 for numbering this new space, like link local, and multicast prefixes, but it was not intended to be co-mingled. Well (I've quoted this many times): IBA section 4.1: A GID is a valid 128-bit IPv6 address (per RFC 2373) with additional properties / restrictions defined within IBA... People often try to claim that this sentence doesn't mean what it very explicitly and clearly says, and certainly I believe that existing practice doesn't comply with the IBA spec, but I don't see how anyone can say that a truly compliant IB GID is not a real IPv6 address. So, I didn't look closely enough, but what was the ethertype that is used here in this patch set? Hopefully not IPv6. I don't think it's specified in the code -- presumably in HCA FW. Which is an issue as you say -- do we have an IEEE ethertype yet? And if we don't use the IPv6 ethertype, then is multicast going to work well (if the code is moved away from just broadcasting everything)? I doubt that switch IGMP snooping works well for non-IP ethertypes -- in fact I wonder how well existing switches handle IPv6 multicast ;) - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] [PATCH 2/9] ib_core: kernel API for GID --MAC translations
Hmm, murky indeed. Your point about IGMPv6 is well made. The problem is that IB GRHs are not IPv6 headers and have different numerology for the Next Header field. Ie in IPv6 Next Header 0x1B is RFC 908, while in GRH it is a BTH. Labeling GRHs with an IPv6 ethertype is fundamentally wrong. Yes, but the next header is the only issue I know of. Since 0x1b is already assigned as an IPv6 next header protocol, we would have to get a new value assigned. However once a non-conflicting value is chosen, then an IB GRH really is an IPv6 header and in that case I think using the IPv6 ethertype too would make things work much better -- eg IB traffic actually could be forwarded by an IPv6 router with no additional work required. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages
Yeah, the notifier code remains untouched as we still do not allow dynamic memory operations _while_ our module is loaded. The patch allows the driver to cope with DMEM operations that happened before the module was loaded, which might result in a non-contiguous memory layout. When the driver registers its global memory region in the system, the memory layout must be considered. We chose the term toleration instead of support to illustrate this. I see. So things just silently broke in some cases when the driver was loaded after operations you didn't tolerate? Anyway, thanks for the explanation. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ofa-general] RE: [ewg] [PATCH 0/9] RDMAoE - RDMA over Ethernet-- some procedural questions
I think that in any case, OFA needs to have a consistent policy, and if we allow something that is not a standard for one member, it should be allowed for all members. Agreed; I think that this is my central point -- thanks for saying it succinctly! Regardless of whether OF asked Oracle to submit RDS or not, it's not associated with any standard (I'm not picking on RDS; I'm picking on the OF rules and selectively applying them). Therefore, either the bylaws are wrong of OF/EWG is wrong. The bylaw in question seems pretty silly given the lack of control or involvement that OFA has with Linux kernel development. Given that RDS is not standardized at the API or wire level and given that RDS is included in the Linux kernel, what options does the OFA have for enforcing its bylaws? Removing RDS from OFED? Once OFED has *fewer* features than the standard kernel it becomes pretty pointless; maybe the logical conclusion is that OFA should get out of the distribution business (my feelings about ending OFED are well-known I think). The same applies to IBoE; if (and that really is if since I don't think a conclusion about merging IBoE support into the kernel has been reached) IBoE goes into the kernel but OFED can't or won't distribute it, then the relevance of OFED becomes marginal. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [ofa-general] [PATCH 0/9] RDMAoE - RDMA over Ethernet
If I might chime in here...I've been working to actively squash the expression 'IBoE' or any variation that includes IB in the name. The reason is because the InfiniBand Architecture is defined as a cohesive solution that includes five layers of the OSI stack. Hence using the expression IBoE creates unnecessary confusion by implying that RDMAoE includes elements of IB from the five OSI layers, which is not an accurate description of the proposal. Seems pretty silly to me -- if I build an IB router that carries IB over the WAN on MPLS, is it not InfiniBand because it uses the wrong layer 1? Similarly the worlds seems to cope with talking about FCoE rather than SCSIoE even though FCoE does not share all the OSI layers of fibre channel. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages
OK, one major issue with this patch and a few minor nits. First, the major issue is that I don't see anything in the patch that changes the code in ehca_mem_notifier() in ehca_main.c: case MEM_GOING_ONLINE: case MEM_GOING_OFFLINE: /* only ok if no hca is attached to the lpar */ spin_lock_irqsave(shca_list_lock, flags); if (list_empty(shca_list)) { spin_unlock_irqrestore(shca_list_lock, flags); return NOTIFY_OK; } else { spin_unlock_irqrestore(shca_list_lock, flags); if (printk_timed_ratelimit(ehca_dmem_warn_time, 30 * 1000)) ehca_gen_err(DMEM operations are not allowed in conjunction with eHCA); return NOTIFY_BAD; } But your patch description says: This patch implements toleration of dynamic memory operations But it seems you're still going to hit the same NOTIFY_BAD case above after your patch. So something doesn't compute for me. Could you explain more? Second, a nit: +#define EHCA_REG_MR 0 +#define EHCA_REG_BUSMAP_MR (~0) and you pass these as the reg_busmap parm in: int ehca_reg_mr(struct ehca_shca *shca, struct ehca_mr *e_mr, u64 *iova_start, @@ -991,7 +1031,8 @@ struct ehca_pd *e_pd, struct ehca_mr_pginfo *pginfo, u32 *lkey, /*OUT*/ -u32 *rkey) /*OUT*/ +u32 *rkey, /*OUT*/ +int reg_busmap) and test it as: +if (reg_busmap) +ret = ehca_reg_bmap_mr_rpages(shca, e_mr, pginfo); +else +ret = ehca_reg_mr_rpages(shca, e_mr, pginfo); So the ~0 for true looks a bit odd. One option would be to make reg_busmap a bool, since that's how you're using it, but then you lose the nice self-documenting macro where you call things. So I think it would be cleaner to do something like enum ehca_reg_type { EHCA_REG_MR, EHCA_REG_BUSMAP_MR }; and make the int reg_busmap parameter into enum ehca_reg_type reg_type and have the code become + if (reg_type == EHCA_REG_BUSMAP_MR) + ret = ehca_reg_bmap_mr_rpages(shca, e_mr, pginfo); + else if (reg_type == EHCA_REG_MR) + ret = ehca_reg_mr_rpages(shca, e_mr, pginfo); + else + ret = -EINVAL or something like that. +struct ib_dma_mapping_ops ehca_dma_mapping_ops = { +.mapping_error = ehca_dma_mapping_error, +.map_single = ehca_dma_map_single, +.unmap_single = ehca_dma_unmap_single, +.map_page = ehca_dma_map_page, +.unmap_page = ehca_dma_unmap_page, +.map_sg = ehca_dma_map_sg, +.unmap_sg = ehca_dma_unmap_sg, +.dma_address = ehca_dma_address, +.dma_len = ehca_dma_len, +.sync_single_for_cpu = ehca_dma_sync_single_for_cpu, +.sync_single_for_device = ehca_dma_sync_single_for_device, +.alloc_coherent = ehca_dma_alloc_coherent, +.free_coherent = ehca_dma_free_coherent, +}; I always think structures like this are easier to read if you align the '=' signs. But no big deal. +ret = ehca_create_busmap(); +if (ret) { +ehca_gen_err(Cannot create busmap.); +goto module_init2; +} + ret = ibmebus_register_driver(ehca_driver); if (ret) { ehca_gen_err(Cannot register eHCA device driver); ret = -EINVAL; -goto module_init2; +goto module_init3; } ret = register_memory_notifier(ehca_mem_nb); if (ret) { ehca_gen_err(Failed registering memory add/remove notifier); -goto module_init3; +goto module_init4; Having to renumber unrelated things is when something changes is why I don't like this style of error path labels. But I think it's well and truly too late to fix that in ehca. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] IB/ehca: Remove superfluous bitmasks from QP control block
looks fine, applied for 2.6.31 ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH 2/2] ib_mthca: Use module parameter for number of MTTs per segment
Sigh... unfortunate to add a tunable that people have to mess with rather than just making things work automatically somehow. Anyway, applied these. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [PATCH 1/3] IB/ehca: Replace vmalloc with kmalloc
did you have a chance to take a look at the patchset and will you apply it, or are there any outstanding issues we need to address? I guess it's OK, but definitely 2.6.31 material. I guess I'll stick it linux-next soon. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH 1/3] IB/ehca: Replace vmalloc with kmalloc
thanks, applied. From: Anton Blanchard antonb at au1.ibm.com Signed-off-by: Stefan Roscher stefan.roscher at de.ibm.com please use '@' signs so these are real email addresses. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH 3/3] IB/ehca: Increment version number
thanks, applied 2 3. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH 0/3] IB/ehca: Perfomance improvment for creation of queue pairs
Because of this fundamental code change we will also increment the version number. So this is 2.6.31-targeted stuff I assume? (Too late in the 2.6.30 cycle for fundamental code change of course) - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH 1/3] IB/ehca: Replace vmalloc with kmalloc
+queue-queue_pages = kmalloc(nr_of_pages * sizeof(void *), GFP_KERNEL); How big might this buffer be? Any chance of allocation failure due to memory fragmentation? - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] [PATCH] ib_mad: Fix RMPP header RRespTime manipulation
thanks, applied. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ofa-general] Re: [ewg] [PATCH] ib_core: save process's virtual address in struct ib_umem
I see... you're right - no need to stick the address into struct ib_umem. Following this email is a new patch for mlx4_ib only. I excluded support for both powerpc and ia64 since I could not find a way to get HPAGE_SIZE (or HPAGE_SHIFT) for them. #include asm/page.h ? - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ofa-general] Re: [ewg] [PATCH] ib_core: save process's virtual address in struct ib_umem
It does not help. The problem with powerpc is that HPAGE_SHIFT is an unexported variable and for ia64 it's hpage_shift. I see. hpage_shift is exported on ia64, so that should be OK. And I guess for powerpc it is just a matter of adding an export so we can use it in a module. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] ib_core: save process's virtual address in struct ib_umem
add address field to struct ib_umem so low level drivers will have this information which may be needed in order to correctly calculate the number of huge pages. seems like a really strange thing to do: + umem-address = addr; this value addr is coming from the low-level driver, so I'm not clear why we need to stick it into umem? What am I missing? - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH v2] mlx4_ib: Optimize hugetlab pages support
+n = PAGE_ALIGN(umem-length + (umem-address ~HPAGE_MASK)) HPAGE_SHIFT; This is still wrong I think. What if the user, say, registers 1MB that is aligned exactly at a 2MB huge page start? Then wouldn't this expression set n to 0? I think that PAGE_ALIGN() needs to be a HPAGE_ALIGN(), although of course that doesn't actually exist, so we would need to do ALIGN(..., HPAGE_SIZE) instead... but given that we're just going to do HPAGE_SHIFT afterwards, maybe DIV_ROUND_UP(..., HPAGE_SIZE) is a better way to write it. It maybe is needed to make sure that the virtual address requested is aligned appropriately for using bigger pages (although this should always be the case in the current code I guess, since the userspace virtual address always matches the HCA MR virtual address). - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ofa-general] Re: [ewg] [PATCH] ib_core: save process's virtual address in struct ib_umem
It is has to be saved either at the low level driver's mr object, e.g. struct mlx4_ib_mr, or at a common place like struct ib_umem. Do you prefer that it will be saved in struct mlx4_ib_mr? I don't see why it has to be saved anywhere? The only place you use umem-address is in handle_hugetlb_usermr(), and you could just as easily pass in start directly as a parameter (since mlx4_ib_reg_user_mr() has that value in a parameter anyway). - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH v1] mlx4_ib: Optimize hugetlab pages support
OK, looks better. However the patch had a bunch of whitespace problems (run checkpatch.pl to see them). Also: +static int handle_hugetlb_user_mr(struct ib_pd *pd, struct mlx4_ib_mr *mr, + u64 virt_addr, int access_flags) +{ +#ifdef CONFIG_HUGETLB_PAGE +struct mlx4_ib_dev *dev = to_mdev(pd-device); +struct ib_umem_chunk *chunk; +unsigned dsize; +dma_addr_t daddr; +unsigned uninitialized_var(cur_size); +dma_addr_t uninitialized_var(cur_addr); +int restart; +int n; +struct ib_umem *umem = mr-umem; +u64 *arr; +int err = 0; +int i; +int j = 0; + +n = PAGE_ALIGN(umem-length + umem-offset) HPAGE_SHIFT; seems this might underestimate by 1 if the region doesn't start/end on a huge-page aligned boundary (but we would still want to use big pages to register it). +arr = kmalloc(n * sizeof *arr, GFP_KERNEL); +if (!arr) +return -ENOMEM; + +restart = 1; +list_for_each_entry(chunk, umem-chunk_list, list) +for (i = 0; i chunk-nmap; ++i) { +daddr = sg_dma_address(chunk-page_list[i]); +dsize = sg_dma_len(chunk-page_list[i]); +if (restart) { +cur_addr = daddr; +cur_size = dsize; +restart = 0; +} else if (cur_addr + cur_size != daddr) { +err = -EINVAL; +goto out; +} else +cur_size += dsize; + +if (cur_size HPAGE_SIZE) { +err = -EINVAL; +goto out; +} else if (cur_size == HPAGE_SIZE) { +restart = 1; +arr[j++] = cur_addr; +} +} I think we could avoid the uninitialized_var() stuff and having restart at all by just doing cur_size = 0 at the start of the loop, and then instead of if (restart) just test if cur_size is 0. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] mlx4_ib: Optimize hugetlab pages support
Since Linux does not merge adjacent pages into a single scatter entry through calls to dma_map_sg(), we do this for huge pages which are guaranteed to be comprised of adjacent natural pages of size PAGE_SIZE. This will result in a significantly lower number of MTT segments used for registering hugetlb memory regions. Actually on some platforms, adjacent pages will be put into a single sg entry (eg look at VMERGE under arch/powerpc). So your mlx4_ib_umem_write_huge_mtt() function won't work in such cases. In any case relying on such a fragile implementation detail with no checking is not a good idea for future maintainability. So this needs to be reimplemented to work no matter how the DMA mapping layer presents the hugetlb region (and also fail gracefully if the bus addresses returned from DMA mapping don't work with the huge page size -- since it would be correct for the DMA mapping layer to map every PAGE_SIZE chunk to an arbitrary bus address, eg if the IOMMU address space becomes very fragmented) @@ -142,15 +176,34 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, n = ib_umem_page_count(mr-umem); shift = ilog2(mr-umem-page_size); - -err = mlx4_mr_alloc(dev-dev, to_mpd(pd)-pdn, virt_addr, length, +if (mr-umem-hugetlb) { +nhuge = ALIGN(n shift, HPAGE_SIZE) HPAGE_SHIFT; +shift_huge = HPAGE_SHIFT; +err = mlx4_mr_alloc(dev-dev, to_mpd(pd)-pdn, virt_addr, length, +convert_access(access_flags), nhuge, shift_huge, mr-mmr); +if (err) +goto err_umem; + +err = mlx4_ib_umem_write_huge_mtt(dev, mr-mmr.mtt, mr-umem, nhuge); +if (err) { +if (err != -EAGAIN) +goto err_mr; +else { +mlx4_mr_free(to_mdev(pd-device)-dev, mr-mmr); +goto regular_pages; +} +} +} else { +regular_pages: +err = mlx4_mr_alloc(dev-dev, to_mpd(pd)-pdn, virt_addr, length, convert_access(access_flags), n, shift, mr-mmr); -if (err) -goto err_umem; +if (err) +goto err_umem; -err = mlx4_ib_umem_write_mtt(dev, mr-mmr.mtt, mr-umem); -if (err) -goto err_mr; +err = mlx4_ib_umem_write_mtt(dev, mr-mmr.mtt, mr-umem); +if (err) +goto err_mr; +} err = mlx4_mr_enable(dev-dev, mr-mmr); if (err) Also I think this could be made much clearer by putting the huge page handling code in a separate function like mlx4_ib_reg_hugetlb_user_mr() and then you could just do something like if (!mr-umem-hugetlb || mlx4_ib_reg_hugetlb_user_mr(...)) { // usual code path } rather than having to use a goto into another block. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: kernel panic while using ipoib with lro enabled
- I've added a print of the skb-data_len and skb-len sizes right before the BUG_ON line. skb-data_len was always 0, so I don't understand why there is a kernel panic. A race of two CPUs doing an skb_pull or something like that? - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: kernel panic while using ipoib with lro enabled
By the way, I don't seem to be able to reproduce this easily in my setup. What kind of HCA/system are you using? - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: kernel panic while using ipoib with lro enabled
We've discovered that sometimes we get a kernel panic while using ipoib with lro enabled. You can find all the details here: https://bugs.openfabrics.org/show_bug.cgi?id=1473 This bug is reproduced both in kernel 2.6.27 and OFED 1.4. I'll try to reproduce it myself, but in the meantime, can you reproduce this with 2.6.28 or even better latest upstream Linus's git tree (without OFED)? - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] IPoIB: fix race when manipulating mcast SKBs queue
ipoib_mcast_free() dequeues SKBs pending on the pkt_queue but needs to do that with netif_tx_lock_bh() acquired. I don't see why this would be required. When ipoib_mcast_free() runs, the mcast structure has been removed from all lists and I don't see how any other context could simultaneously be adding packets to pkt_queue. What is the race that you think this fixes? - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] IB/ehca: Change misleading error message
The error message printed when the eHCA driver prevents memory hotplug is misleading -- the user might think that hot-removing the lhca, hotplugging memory, then hot-adding the lhca again will work, but it doesn't. That's too bad... I applied this patch but out of curiousity, why doesn't the hot-remove/hot-add work? I would have thought that re-registering all of memory after the hot-add would do the right thing. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] IB/ehca: Fix locking for shca_list_lock
Looks good... I'll add this for 2.6.29, since as far as I can tell this bug has been there approximately forever already. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg