Re: [ewg] Mellanox target workaround in SRP

2011-01-10 Thread Roland Dreier
  I think that the patch is specific for srp initiator using Mellanox
  FMR. It tried to avoid indirect desc with Mellanox FMR having
  first-byte-offset != 0.  Since the low level implementation of
  mlx4/mthca_map_phys_fmr() did not create + setup MPT for FMR with
  first_byte_offset != 0. The corruption can happen with any target.

I don't think this could be right -- right now the workaround only
triggers if the target has a Mellanox OUI, so if what you say is true,
presumably everyone who is using the SRP initiator with mlx4 would be
seeing this problem.

Also, the SRP initiator code that uses ib_fmr_pool_map_phys does not
pass in any non-aligned addresses -- it doesn't try to use any first
byte offset, it just uses the virtual address it passes to the target to
handle the offset.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Mellanox target workaround in SRP

2011-01-10 Thread Roland Dreier
Maybe we can use MST's current email to ask him... Michael, do you have
any memory of the issue we worked around here?

  I have question regarding workaround introduced in commit 559ce8f1 of
  the mainline tree:
  
  IB/srp: Work around data corruption bug on Mellanox targets
  
  Data corruption has been seen with Mellanox SRP targets when FMRs
  create a memory region with I/O virtual address != 0.  Add a
  workaround that disables FMR merging for Mellanox targets (OUI 0002c9).
  
  I don't see how this can make a difference to the target -- it sees an
  address and length, and there should be no visible difference to it when
  it gets an FMR versus a direct-mapped region of the same space, right?
  And how is it different than getting a direct or indirect descriptor
  with a similar offset?
  
  I could see there being a bug on the initiator HCA not liking such FMR
  mappings, but then it should be keyed off of the vendor of our HCA and
  not the target.
  
  I'm sure this was tested and shown to fix the problem; I'm just confused
  as to what the problem really was and if this is still relevant. Can
  someone please enlighten me?
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Mellanox target workaround in SRP

2011-01-07 Thread Roland Dreier
  I'm sure this was tested and shown to fix the problem; I'm just confused
  as to what the problem really was and if this is still relevant. Can
  someone please enlighten me?

At this point I'm afraid it's all lost in the mists of time, but the
original patch seems to have come from

http://lists.openfabrics.org/pipermail/general/2006-July/024322.html

looking at the patch, I would guess that the corruption occurred when
the target got an IO request that started at a non-page-aligned address
but that spanned more than one page.

I don't know if the target was ever fixed, or whether that target code
has any relevance today.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] IPoIB to Ethernet routing performance

2010-12-17 Thread Roland Dreier
This may be due to the fact that the IB MTU is 2048. Every 1500 bytes 
  packet
  is padded to 2048 bytes before being sent through the wire, so you're loosing
  roughly 25% bandwidth compared to an IPoIB MTU which is a multiple of 2048.

This isn't true.  IB packets are only padded to a multiple of 4 bytes.

However there's no point in using IPoIB connected mode to pass packets
smaller than the IB MTU -- you might as well use datagram mode.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg



Re: [ewg] [PATCH 00/10] Add fcoe, fcoib drivers for mlx4 device

2010-08-09 Thread Roland Dreier
Unfortunately it will probably be a while before I can look at this in
any detail, but here are some obvious things to fix:

 - The actual FC/SCSI driver will need to be reviewed on
   linux-s...@vger.kernel.org of course; so please CC that list.

 - The ordering/split of patches seems off; in particular I notice that
   the patch that adds the new drivers into the build system comes
   before the patch that actually adds the new drivers, which is
   obviously wrong (since the kernel will be broken if we apply patch
   9/10 but not 10/10).

 - None of the patch subjects contain a subject that says what part of
   the kernel they touch.  For example, instead of

[PATCH 01/10]  pre-reserve MPTs for FC

   please write:

[PATCH 01/10] mlx4_core: Pre-reserve MPTs for FC

   otherwise I waste time having to fix it up by hand.  (And also if you
   want to do things exactly write, please capitalize the subject but do
   not put in a final '.'  But I usually do not complain about that
   level of detail)

 - Every patch seems to have the changelog and diffstat twice.  Please
   fix (and include a --- between the changelog and diffstat).  If you
   prepare a clean git tree, then git format-patch will get everything
   exactly correct.

 - There seem to be a lot of whitespace and other formatting problems.
   Please run patches through checkpatch.pl and fix any real problems
   that it finds (0 warnings are not necessary since some warnings are
   false positives; but I suspect for these patches most won't be).
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH v2] libibverbs: ibv_fork_init() and libhugetlbfs

2010-07-06 Thread Roland Dreier
  We thought about this too, but in some special cases, we do not know the
  correct page size of a memory range. For example when getting a 16M chunk
  from a 16M huge page region which is also aligned to 16M, the first madvise()
  will work fine and the code will assume that the page size is 64K.

I see ... yes, that does break my idea completely.

OK, another half-baked idea: what if we pay attention to when
madvise(DOFORK) fails as well as well madvise(DONTFORK) fails, and use
that as a hit that we better check the page size?

Perhaps this adds too much complexity ... in which case your idea:

  As this issue was not present in version 2 of the code, but there we had
  a big performance penalty, I suggest the following: we could go back to
  version 2 and introduce a new RDMAV_HUGEPAGE_SAFE env variable to let the 
  user
  decide between huge page support and better performance (the same approach we
  use for the COW protection itself).

seems like a reasonable alternative.

Thanks,
  Roland
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH v2] libibverbs: ibv_fork_init() and libhugetlbfs

2010-07-03 Thread Roland Dreier
   When registering two memory regions A and B from within
  the same huge page, we will end up with one node in the tree which covers the
  whole huge page after registering A. When the second MR is registered, a node
  is created with the MR size rounded to the system page size (as there is no
  need to call madvise(), it is not noticed that MR B is part of a huge page).
  
  Now if MR A is deregistered before MR B, I see that the tree containing
  mem_nodes is empty afterwards, which causes problems for the deregistration 
  of
  MR B, leaving the tree in a corrupted state with negative refcounts. This 
  also
  breaks later registrations of other memory regions within this huge page.

Good thing I didn't get around to applying the patch yet ;)

I haven't thought this through fully, but it seems that maybe we could
extend the madvise tracking tree to keep track of the page size used for
each node in the tree.  Then for the registration of MR B above, we
would find the node for MR A covered MR B and we should be able to get
the ref counting right.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH v4] IB Core: RAW ETH support

2010-06-23 Thread Roland Dreier
  There is no qp type IBV_QPT_RAW_ETY in user space (at least not in the 
  definitions
  coming with libibverbs). In fact, libibverbs that comes with OFED defines 
  (in verbs.h)
  a qp type called IBV_QPT_RAW_ETT which equals to 7.
  The patch that is under discussion here adds a new qp type IB_QPT_RAW_ETH 
  and equals it to 7
  to match the definition in user space. This indeed changes the value of 
  IB_QPT_RAW_ETY to 8
  but I don't see who can be affected since
  1. No user space program that uses IB_QPT_RAW_ETY exists
  2. kernel is compiled as one piece of code.

Why renumber the _ETY enum?  Maybe it doesn't break anything serious but
why risk it?
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH v4] IB Core: RAW ETH support

2010-06-15 Thread Roland Dreier
  I tested it before on the Roland tree ( iboe branch ) and it fails,
  because it writen in the way suitable for OFED. If adapt the patch to
  the Roland tree, then appling Mellanox OFED patches will fail, because
  it changes the same functions in the code.
  Here is one example:
  Look at __mlx4_ib_modify_qp at the Roland tree - there is no RAW_ETY
  support. But in the OFED version of the same function this support is
  present.
  RAW_ETH patch modify this function and looking for RAW_ETY word and
  without this RAW_ETH Mellanox patch will fail.

Don't take this too personally -- I picked a semi-random email in this
thread to reply to; this is pretty broadly targeted.

rant

What the hell is the thinking behind introducing IB_QPT_RAW_ETH?  You're
inserting an enum value before IB_QPT_RAW_ETY, so any old userspace
passing in IB_QPT_RAW_ETY will silently get different behavior depending
on the kernel version.  And you're creating two constands that differ in
a single letter (IB_QPT_RAW_ETY vs. IB_QPT_RAW_ETH).  How are you going
to explain that to users?  How is anyone ever going to get it right?
For that matter, what exactly does IB_QPT_RAW_ETH mean?

This all seems to be a symptom of how broken our development process
is.  Yes, unfortunately I can't spend as much time reviewing and
applying patches as I might like, and I apologize for that.  But if we
have all the RDMA developers piling up shit in their little area and
then sending it on to be merged as soon as it kind of works, without
thinking about design or maintainability and without ever doing any
review, then I'm always going to have an expanding review backlog.

And then we have OFED compounding problems -- Oh that's a nice pile of
shit you've built there.  We better ship it to users while it's still
steaming.  How about if OFED developers take a little time to think
things through?

/rant

In other words, can someone explain the plan for this raw QP stuff to me?

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH v2] libibverbs: ibv_fork_init() and libhugetlbfs

2010-06-09 Thread Roland Dreier
Thanks, nice work.  I like this approach.  Alex (Vainman) any comments
on this?

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] Handling busy responses from the SA

2010-06-04 Thread Roland Dreier
  The purpose of this patch is to cause the ib_mad driver to discard
  busy responses from the SA, effectively causing busy responses to
  become time outs.

I don't have a strong opinion on this but it seems a bit odd.  If we're
just going to drop the response anyway, why did the SA send it in the
first place?  On the other hand, if the SA told us it's busy, it does
seem we could do something more sensible than retrying immediately.

Any opinions from anyone who worked on fabric scalability?

  +printk(KERN_NOTICE PFX Response returned with 
  MAD_STATUS_BUSY\n);

Do we want to spam kernel logs with this?  Seems it could generate a lot
of messages.

  +#define IB_MGMT_MAD_STATUS_SUCCESS  
  0x
  +#define IB_MGMT_MAD_STATUS_BUSY 
  0x0001
  +#define IB_MGMT_MAD_STATUS_REDIRECT_REQD0x0002
  +#define IB_MGMT_MAD_STATUS_BAD_VERERSION0x0004  
  +#define IB_MGMT_MAD_STATUS_UNSUPPORTED_METHOD   0x0008  
  +#define IB_MGMT_MAD_STATUS_UNSUPPORTED_METHOD_ATTRIB0x000c
  +#define IB_MGMT_MAD_STATUS_INVALID_ATTRIB_VALUE 0x001c

The indentation of values seems pretty crazy here.  Also I'm not sure
what most of these defines are for?  They seem unused in this patch.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Allowing ib dignostics to be run without being logged in as root.

2010-06-02 Thread Roland Dreier
   $ cat /etc/udev/rules.d/80-ib-umad.rules
   KERNEL==umad*, NAME=infiniband/%k, MODE=0666

  It is not the same. Your propose to expose /dev/infiniband/umad device
  access to all world, which is obviously even more dangerous than SUIDing
  diagnostic programs.

Well, different threats.  Making umad files world-writable means anyone
can inject whatever MADs they want to into the fabric.  On the other
hand, if an arbitrary code execution security hole is found in a
diagnostic program, then having it SUID root means the hole becomes a
local root exploit.  It's hard to assess which is really more dangerous.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH v2] libibverbs: ibv_fork_init() and libhugetlbfs

2010-06-02 Thread Roland Dreier
  without patch:
  1M memory region120usec
  16M memory region  1970usec 
  
  with patch v2:
  1M memory region172usec
  16M memory region  2030usec

So if I read this correctly this patch introduces almost a 50% overhead
in the 1M case... and probably much worse (as a fraction) in say the 64K
or 4K case.  I wonder if that's acceptable?

Alex's original approach was to try the memadvise with normal page size
and then fall back to huge page size if that failed.  But of course that
wastes some time on the failed madvise in the hugepage case.

I think it would be interesting to compare timings for registering, say,
4K, 64K, 1M and 16M regions with and without huge page backing, for
three possibilities:

 - unpatched libibverbs (will obviously fail on hugepage backed regions)
 - patched with this v2
 - alternate patch that tries madvise and only does your /proc/pid/smaps
 - parsing if the first madvise fails.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Question: When should patches be submitted to EWG and when should they be submitted to linux-rdma?

2010-05-26 Thread Roland Dreier
  The subject says it all. If I have a patch that can be applied
  against either the current OFED git repository or against the
  upstream kernel - where do I post it?

What do you want to happen to the patch?  If you want it applied to the
upstream kernel, then send it to me and linux-rdma.  If you want it
applied to an OFED tree, send it to ewg.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 05/11] ib_core: IBoE UD packet packing support

2010-05-17 Thread Roland Dreier
  I thought you were referring to the changes made by this patch
  920d706.  Should I re-send this patch?

Yes, please work out what changes are still required and send a new patch.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 04/11] ib_core: IBoE CMA device binding

2010-05-14 Thread Roland Dreier
   I'm having a hard time working out why the iboe case needs to schedule
   to a work queue here since its already in process context, right?  It
   seems it would be really preferable to avoid all the extra pointer
   munging and reference counting, and just call things directly.

  I assume that the caller might attempt to acquire the same lock when
  calling join and in the callback. Specifically, ucma_join_multicast()
  calls rdma_join_multicast() with file-mut acquired and
  ucma_event_handler() does the same.

I see... we can't call the consumer's callback directly since it might
have locking assumptions.

It would be nice if we didn't have this reference counting sometimes
used and sometimes not used.  I'll have to think about whether this can
be made cleaner.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] mlx4_en: ping does not resume after failover

2010-05-13 Thread Roland Dreier
Also keep in mind that mlx4_en patches typically go through Davem and
netdev, not my tree.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1

2010-05-13 Thread Roland Dreier
   Also I'm wondering why you did the count stuff to ignore IBoE-only
   devices in multicast.c but not sa_query.c?

  It seems to me the right place to put this logic as the mutlicast code
  registers as an IB client and the add_one implemntation is
  multicast.c.

Right, I'm not saying it shouldn't be in multicast.c.  I'm just
wondering why you don't have the same thing for sa_query.c.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1

2010-05-13 Thread Roland Dreier
  One reason is that get_src_path_mask() may access ah_lock.

I don't see that:

static u8 get_src_path_mask(struct ib_device *device, u8 port_num)
{
struct ib_sa_device *sa_dev;
struct ib_sa_port   *port;
unsigned long flags;
u8 src_path_mask;

sa_dev = ib_get_client_data(device, sa_client);
if (!sa_dev)
return 0x7f;

so if we never add the sa_client structure it just returns.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 05/11] ib_core: IBoE UD packet packing support

2010-05-13 Thread Roland Dreier
  Right, it's already applied.

Doesn't seem to be all there... eg there is nothing for ethernet headers.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 07/11] ib_core: Add API to support IBoE from userspace

2010-05-13 Thread Roland Dreier
  Basically, what I want to understand is why does this change make sense?
  
  @@ -1139,6 +1139,10 @@ struct ib_device {
 struct ib_grh *in_grh,
 struct ib_mad *in_mad,
 struct ib_mad *out_mad);
  +int(*get_eth_l2_addr)(struct ib_device *device,
  u8 port,
  +  union ib_gid *dgid, int
  sgid_idx,
  +  u8 *mac, u16 *vlan_id, u8
  *tagged);
  +

Yes, that was pretty much my original question.  Why do we have a verb
for userspace to call a device-specific method to do the mapping?  The
layering seems wrong somewhere if we have a generic verb to do this
mapping, but then put the mapping in device-specific code.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [RFC] libibverbs: ibv_fork_init() and libhugetlbfs

2010-05-12 Thread Roland Dreier
   * added get_huge_page_size() to read the huge page size from
 /proc/meminfo. This is done at ibv_fork_init() time.

That doesn't work on systems that have multiple huge page sizes (eg
powerpc).  You can compare the code to get the size in libhugetlbfs.

Also I think the munging through /proc/pid/maps doesn't really work.
First of all, essentially grepping for libhugetlbfs is not robust as I
mentioned -- this will break in the same way for apps using huge pages
directly.  And while it is nice to be able to tell if a range came from
libhugetlbfs, I think there may be some bad performance impact from
reading /proc/pid/maps line-by-line.  (And by the way, as a trivial
optimization, it would make sense to me to check the address of each map
before doing the strstr).

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1

2010-05-12 Thread Roland Dreier
  @@ -1017,9 +1020,12 @@ static void ib_sa_add_one(struct ib_device *device)
   sa_dev-end_port   = e;
   
   for (i = 0; i = e - s; ++i) {
  +spin_lock_init(sa_dev-port[i].ah_lock);
  +if (rdma_port_link_layer(device, i + 1) != 
  IB_LINK_LAYER_INFINIBAND)
  +continue;

Not sure I understand why you move the initialization of the spinlock up
here?  It seems we ignore everything that might have to do with spinlock
if this is an IBoE port.

But the larger issue is what if someone calls one of the ib_sa_XXX_query
functions on an IBoE port?  Seems we just crash on uninitialized
structures.  I guess we're assuming that the kernel is smart enough not
to do that?

Also I'm wondering why you did the count stuff to ignore IBoE-only
devices in multicast.c but not sa_query.c?

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 04/11] ib_core: IBoE CMA device binding

2010-05-12 Thread Roland Dreier
   int ib_init_ah_from_path(struct ib_device *device, u8 port_num,
  - struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr)
  + struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr,
  + int force_grh)

Rather than this change in API, could we just have this function look at
the link layer, and if it's ethernet, then always set the GRH flag?
Seems simpler than requiring the upper layers to do this and then pass
the result in?
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 04/11] ib_core: IBoE CMA device binding

2010-05-12 Thread Roland Dreier
  +static void iboe_mcast_work_handler(struct work_struct *work)
  +{
  +struct iboe_mcast_work *mw = container_of(work, struct iboe_mcast_work, 
  work);
  +struct cma_multicast *mc = mw-mc;
  +struct ib_sa_multicast *m = mc-multicast.ib;
  +
  +mc-multicast.ib-context = mc;
  +cma_ib_mc_handler(0, m);
  +kref_put(mc-mcref, release_mc);
  +kfree(mw);
  +}

I'm having a hard time working out why the iboe case needs to schedule
to a work queue here since its already in process context, right?  It
seems it would be really preferable to avoid all the extra pointer
munging and reference counting, and just call things directly.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 05/11] ib_core: IBoE UD packet packing support

2010-05-12 Thread Roland Dreier
  2. Fix wrong implementation of ib_ud_header_init(). A different patch was 
  sent
 to Roland.

This patch no longer applies, I guess because you already sent me this
fix (which is now upstream since 2.6.34-rc1).
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 04/11] ib_core: IBoE CMA device binding

2010-05-12 Thread Roland Dreier
  Multicast GIDs are always mapped to multicast MACs
  as is done in IPv6. Some helper functions are added to ib_addr.h.  IPv4
  multicast is enabled by translating IPv4 multicast addresses to IPv6 
  multicast
  as described in
  http://www.mail-archive.com/i...@sunroof.eng.sun.com/msg02134.html.

I guess it's a bit unfortunate that the RoCE annex completely ignored
how to map multicast GIDs to ethernet addresses (I suppose as part of
the larger decision to ignore address resolution entirely).  Anyway,
looking at the email message you reference, it seems to be someone
asking what the right way to map IPv4 multicast addresses to IPv6
addresses is.  Is there a more definitive document you can point to?

It seems that unfortunately the way the layering of addresses is done,
there's no way to just use the standard mapping of IPv4 multicast
addresses to Ethernet addresses (since IBoE is does addressing via
the CMA mapping to GIDs followed by an unspecified mapping from GIDs to
Ethernet addresses).

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 07/11] ib_core: Add API to support IBoE from userspace

2010-05-12 Thread Roland Dreier
  Add ib_uverbs_get_eth_l2_addr() to allow ibv_create_ah() to resolve sgid,
  dgid to vlan, dmac for any gid type.  Although user-space might bypass 
  this
  call for link-local gids, it is better not to replicate the kernel resolution
  policy.  Port link layer is also returned by ibv_query_port().

A high-level comment/question, followed by some notes about the specific patch.

At the highest level, is having this very low-level command exposed as
part of the kernel uverbs - userspace API the right place to split
things?  Making the Ethernet address resolution part of the low-level
driver implies that it's not really a generic part of the verbs interface.

Maybe it is generic, and we should have a generic function instead of
calling into the low-level driver.  I see the argument that we should
keep the policy in the kernel, although I'm not sure how strong that
argument is -- once we start shipping a kernel with a certain policy
(and I guess OFED has in effect already done that), how could we ever
change that policy?  We'll have interoperability issues anyway, so it
seems having userspace and kernel use different policies doesn't cause
much further problems anyway.

Or maybe it is device-specific, and we could wrap it up into the create
AH uverbs call we already have?

Low-level comments:

  +ssize_t ib_uverbs_get_eth_l2_addr(struct ib_uverbs_file *file, const char 
  __user *buf,
  +  int in_len, int out_len)
  +{
  +struct ib_uverbs_get_eth_l2_addr   cmd;
  +struct ib_uverbs_get_eth_l2_addr_resp  resp;
  +int  ret;
  +struct ib_pd*pd;
  +
  +if (out_len  sizeof resp)
  +return -ENOSPC;
  +
  +if (copy_from_user(cmd, buf, sizeof cmd))
  +return -EFAULT;
  +
  +pd = idr_read_pd(cmd.pd_handle, file-ucontext);
  +if (!pd)
  +return -EINVAL;
  +
  +ret = ib_get_eth_l2_addr(pd-device, cmd.port, (union ib_gid *)cmd.gid,
  + cmd.sgid_idx, resp.mac, resp.vlan_id, 
  resp.tagged);
  +put_pd_read(pd);
  +if (!ret) {
  +if (copy_to_user((void __user *) (unsigned long) cmd.response,
  + resp, sizeof resp))
  +return -EFAULT;

This leaks kernel memory contents to userspace since the stack variable
resp is never cleared.  Also will cause problems if we ever need to use
the reserved fields for anything.

Also I'm not sure I understand why you pass the PD into this method?  It
seems you just use it to get a pointer to the device, but you already
have that from the uverbs_file structure that's passed into all commands.

  +int ib_get_eth_l2_addr(struct ib_device *device, u8 port, union ib_gid *gid,
  +   int sgid_idx, u8 *mac, __u16 *vlan_id, u8 *tagged)
  +{
  +if (!device-get_eth_l2_addr)
  +return -ENOSYS;
  +
  +return device-get_eth_l2_addr(device, port, gid, sgid_idx, mac, 
  vlan_id, tagged);
  +}
  +EXPORT_SYMBOL(ib_get_eth_l2_addr);

I don't think we need this wrapper, since uverbs can just call the
get_eth_l2_addr method directly (we already do that for quite a few
other methods, eg alloc_ucontext is a uverbs-only method that has no
in-kernel wrapper).  Also the -ENOSYS test isn't necessary, since a
device-driver shouldn't allow this method unless it actually implements
it.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 08/11] mlx4: Allow interfaces to correspond to each other

2010-05-12 Thread Roland Dreier
  +void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int 
  port)
  +{
  +return mlx4_find_get_prot_dev(dev, proto, port);
  +}
  +EXPORT_SYMBOL(mlx4_get_prot_dev);

Not sure I understand why you have a wrapper to call another function
with exactly the same parameters?  Can't we get rid of this and just
rename mlx4_find_get_prot_dev() to mlx4_get_prot_dev()?
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 09/11] mlx4: Add support for IBoE - address resolution

2010-05-12 Thread Roland Dreier
  +u8  mac_0_1[2];
  +u8  mac_2_5[4];

This looks a bit odd.  Any reason why you don't just have u8 mac[6];
in this structure?
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 06/11] ipoib: avoid ipoib over IBoE

2010-05-06 Thread Roland Dreier
OK, I applied this with just the first chunk.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [RFC] libibverbs: ibv_fork_init() and libhugetlbfs

2010-05-06 Thread Roland Dreier
  When fork support is enabled in libibverbs, madvise() is called for every
  memory page that is registered as a memory region. Memory ranges that
  are passed to madvise() must be page aligned and the size must be a
  multiple of the page size. libibverbs uses sysconf(_SC_PAGESIZE) to find
  out the system page size and rounds all ranges passed to reg_mr() according
  to this page size. When memory from libhugetlbfs is passed to reg_mr(), this
  does not work as the page size for this memory range might be different
  (e.g. 16Mb). So libibverbs would have to use the huge page size to
  calculate a page aligned range for madvise.

Yes, Alex Vainman reaised this same issue a while ago.

  The patch below demonstrates a possible solution for this. It parses the
  /proc/PID/maps file when registering a memory region and decides if the
  memory that is to be registered is part of a libhugetlbfs range or not. If 
  so,
  a page size of 16Mb is used to align the memory range passed to madvise().
  
  We see two problems with this: it is not a very elegant solution to parse the
  procfs file and the 16Mb are hardcoded currently. The latter point could be
  solved by calling gethugepagesize() from libhugetlbfs, which would add a new
  dependency to libibverbs.

I think that we cannot assume huge pages only come from libhugetlbfs --
we should support an application directly enabling huge pages (possibly
via another library too, so we can't assume that an application knows
the page size for a memory range it is about to register).

And also the 16 MB page size constant is of course not feasible -- with
all due respect, the x86 page size of 2 MB is much more likely in
practice :)  (Although perhaps the much slower PowerPC TLB refill makes
users more likely to try and use hugetlb pages ;)

Alex suggested parsing files in the same way as libhugetlbfs does to get
the page size, and that seems to be the best solution, since I don't
think the libhugetlbfs license is compatible with the BSD license for
libibverbs.

But your trick of using /proc/*/maps looks nice.  Does that only work
for libhugetlbfs or can we recognize direct mmap of hugetlb pages?

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] mlx4_core: request MSIX vectors as much as there CPU cores

2010-05-05 Thread Roland Dreier
  We found it in performance work of our EN (10G) driver

What does this have to do with performance?

Do you have a system where num_possible_cpus() differs significantly
from num_online_cpus()?  What kernel is that with?  As far as I can
tell, for x86 they should only be different if there genuinely are CPUs
that are available for hotplug.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 03/11] IB/umad: Enable support only for IB ports

2010-05-05 Thread Roland Dreier
Why do we not allow umad for IBoE ports?  I understand there's no QP0
but why can't userspace use QP1 just like for IB link layer ports?
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 01/11] ib core: Add link layer property to ports

2010-05-05 Thread Roland Dreier
Hi Eli,

I'm hoping to get this IBoE stuff in for 2.6.35.  I started an iboe
branch in my tree (similar to the xrc branch I've been carrying for a
while), and I added this patch in, except I renamed
rdma_port_link_layer() to rdma_port_get_link_layer().  This seems to
match rdma_node_get_transport() better.

In any case as I add patches to my branch, you can stop worrying about
them, which should make keeping this series updated easier.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 06/11] ipoib: avoid ipoib over IBoE

2010-05-05 Thread Roland Dreier
  @@ -1383,6 +1385,9 @@ static void ipoib_remove_one(struct ib_device *device)
   dev_list = ib_get_client_data(device, ipoib_client);
   
   list_for_each_entry_safe(priv, tmp, dev_list, list) {
  +if (rdma_port_link_layer(device, priv-port) != 
  IB_LINK_LAYER_INFINIBAND)
  +continue;

Why do we need this chunk here?  How could a netdev get on our list if
we never create IPoIB interfaces for IBoE ports?

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1

2010-05-05 Thread Roland Dreier
  @@ -795,11 +799,12 @@ static void mcast_add_one(struct ib_device *device)
   struct mcast_device *dev;
   struct mcast_port *port;
   int i;
  +int count = 0;
   
   if (rdma_node_get_transport(device-node_type) != RDMA_TRANSPORT_IB)
   return;
   
  -dev = kmalloc(sizeof *dev + device-phys_port_cnt * sizeof *port,
  +dev = kzalloc(sizeof *dev + device-phys_port_cnt * sizeof *port,

  @@ -1007,7 +1010,7 @@ static void ib_sa_add_one(struct ib_device *device)
   e = device-phys_port_cnt;
   }
   
  -sa_dev = kmalloc(sizeof *sa_dev +
  +sa_dev = kzalloc(sizeof *sa_dev +

Do you happen to remember why you needed these kmalloc - kzalloc conversions?
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Version of OFED in RHEL6?

2010-04-22 Thread Roland Dreier
  So what is the answer to the OP's original question though?  Probably
  this is not really a question for this list, but surely, somebody here
  knows...  Given the above statement about getting OFED from the upstream
  kernel rather than an OFA release, what does that mean in terms of which
  actual version of OFED will be in RHEL6?

I think this whole discussion is kind of ridiculous.  OFED is a
distribution of RDMA software, with the kernel as one of its main
upstream sources.  So you could ask what version of the kernel a given
OFED release is based on.  But getting OFED from the upstream kernel
is nonsense -- RHEL6 is getting RDMA drivers from the upstream kernel.

So asking what version of OFED will be in RHEL6 is a bit like asking
what version of SLES will be in RHEL6.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Version of OFED in RHEL6?

2010-04-22 Thread Roland Dreier
   I think this whole discussion is kind of ridiculous.

  I'm the OP. It might be naive but I don't think it's ridiculous,

Yes, I wasn't replying to your original question.  Just the continuing
discussion seemed silly -- once the answer RHEL6 won't contain OFED
came, continuing to ask, OK, but what version of OFED will it contain?
is ridiculous I think.

   OFED is a
   distribution of RDMA software, with the kernel as one of its main
   upstream sources.  So you could ask what version of the kernel a given
   OFED release is based on.  But getting OFED from the upstream kernel
   is nonsense -- RHEL6 is getting RDMA drivers from the upstream kernel.

  When I looked at an OFED 1.4 release, I saw a ton
  of stuff that didn't look like it had much to do
  with RDMA. For example, consider all the commands whose name
  begins with ib.

Sorry, I was using RDMA as a generic term to cover both iWARP and
InfiniBand.  It's true that OFED contains a lot of stuff not strictly
having to do with remote direct memory access.

  The reason I asked this in the first place is that
  the cluster software package I use, Rocks, needs help
  in how it supports IB. I was trying to do something
  about this. One approach would be to just use whatever
  comes with RHEL, but the talk mentioned below doesn't
  say anything about what version of OFED this will be.

  Getting up to speed with IB sure is difficult.

Yes, this shows all the unfortunate confusion caused by trying to
brand OFED as something special.  What version of, say, the IP stack
does RHEL contain?  You don't care, right?  In general, RHEL is going to
contain version XYZ of the kernel, plus patches A, B, C, along with
version UVW of libfoo and version MNO of libbar, etc.  The
InfiniBand/RDMA world shouldn't be special -- the rest of the world
deals with that just fine, without having to tie RHEL's versioning to
someone else's distribution.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] retry_cnt, rnr_retry and ibv_get_cq_event/ibv_req_notify_cq

2010-04-19 Thread Roland Dreier
  If retry_cnt and rnr_retry are set to 7 (toggle the RETRY_COUNT
  #define at the top of the example code) as lots of documentation and
  examples suggest, I never get any events in my CQ. Setting it to
  another value, e.g. 6 causes the appropriate WCs to be generated.

an rnr_retry of 7 means unlimited RNR retries -- ie if no receive is
available at the receiver, then the sender will continue trying forever.
For full details, you can look at the RNR NAK mechanism in the IB spec.

  Furthermore, I was wondering if there is a way to use
  ibv_get_cq_event/ibv_req_notify_cq to do a non-blocking poll for CQ
  events? I have the option in the attached code, but I've found that I
  miss some CQ events, for example, when posting 100 sends, I will only
  get back about 3--5 WCs containing error statuses. If I poll, I get
  all the WCs.

There seems to be some confusion here between CQ events (which are
single-shot notifications generated when a completion is enqueued into
an armed CQ -- if another completion is added before notification is
requested again, then another event is not generated).

So for 100 work requests, you may not always get 100 CQ events, but you
should always get 100 work completions enqueued on the CQ eventually.

In any case, it is certainly fine to do poll(), epoll, SIGIO, whatever
on the completion channel file descriptor.  You can do fcntl to set it
to nonblocking mode too.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue

2010-03-11 Thread Roland Dreier
good debugging, applied thanks.

I do worry (as Moni mentioned) that this doesn't explain why you would
get send failures in this case, but the patch itself is well-explained
and looks obviously correct so I think we should apply it.
-- 
Roland Dreier  rola...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue

2010-03-11 Thread Roland Dreier
  Sorry, I was referring to my patch not Eli's.

Heh, I never would have said anything about your patch was obvious.
I skimmed yours once but I do want to read it more carefully.

Did you ever say what test case you are using to provoke the problem you're 
fixing?
-- 
Roland Dreier  rola...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] ib/ehca: fix in_wc handling in process_mad()

2010-02-19 Thread Roland Dreier
thanks, applied.
-- 
Roland Dreier  rola...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] [for-2.6.33] rdma/cm: revert associating an RDMA device when binding to loopback

2010-02-10 Thread Roland Dreier
OK, I'm planning on sending this upstream later today.  Looks very small
and simple, and then we can figure our what if anything we want to do
for 2.6.34.

Make sense for everyone?

 - R.
-- 
Roland Dreier  rola...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] bug 1918 - openmpi broken due to rdma-cm changes

2010-02-05 Thread Roland Dreier
   That should be the patch in question.  I'm not sure about reaching 
   consensus. :)
   If the other changes to the rdma_cm aren't closely tied to that change, we 
   may
   be able to back that one patch out until we can get whatever other fix may 
   be
   needed.

  I'd like to do this approach.  Then re-submit once we come to consensus...

That makes sense to me.  Someone please send me a tested revert.
-- 
Roland Dreier rola...@cisco.com
Cisco.com - http://www.cisco.com

For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] bug 1918 - openmpi broken due to rdma-cm changes

2010-02-04 Thread Roland Dreier
  Is this only an iwarp issue?  IE do all IB devices support hw
  loopback?  And will all future devices support it (IE is it an IBTA
  requirement)?

I do think IBA requires loopback to work.  Can't quote chapter  verse
off the top of my head.
-- 
Roland Dreier rola...@cisco.com
Cisco.com - http://www.cisco.com

For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] EWG/OFED meeting minutes for Jan 25 2010

2010-01-25 Thread Roland Dreier

  - Add - uMMU notifier kernel module - In general we wish to add it,
  Tziporet will try to find someone that can do it soon

Please do *NOT* add ummunotify to OFED.  The upstream kernel has
explicitly rejected it (at least for the moment).  It adds a huge
maintenance and compatibility headache if OFED ships ummunotify and then
upstream takes an incompatible version.

Effort would be much better spent getting ummunotify or something like
it into shape where upstream will merge it.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Does anybody have experience using Mellanox ConnectX cards with Ubuntu?

2010-01-15 Thread Roland Dreier

  I haven't installed any drivers on this system as my understanding was
  that they are already included in the newer kernels, and although I
  don't have a lot of linux knowledge, it does appear that mlx4_core is
  running:

You also need to load mlx4_ib, either modprobe mlx4_ib or add a line
mlx4_ib to your /etc/modules.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Is this the right place for questions on OFED?

2010-01-11 Thread Roland Dreier

  I'm trying to install OFED on Ubuntu 9.10 without much success.  OFED
  1.5 is failing with an error:

Why do you want to install OFED?  Ubuntu 9.10 already has pretty
up-to-date IB/RDMA support included natively (eg aptitude install
libibverbs1 etc).  What are you missing?

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [PATCH] rdmaoe/libibverbs: handle binary compatibility

2009-12-10 Thread Roland Dreier

#define IB_USER_VERBS_MIN_ABI_VERSION  1
   -#define IB_USER_VERBS_MAX_ABI_VERSION  6
   +#define IB_USER_VERBS_MAX_ABI_VERSION  7
   
  Whats this about? That seems like it needs a much bigger review,
  changing the kernel ABI version instantly breaks every existing
  libibverbs, shouldn't be done without alot of discussion!!

Yes, I've been meaning to bring this up earlier.

There was a time, long ago, when changing this ABI was maybe OK.  I
think even when I first designed the uverbs ABI, having this version
instead of capability bits was a goof.  But I don't have a time machine.

In any case I think at this point we need to stick with ABI version 6
will full backwards compat unless we really really can't.  And I don't
think the changes here are nearly drastic enough that we can't figure
out a way forward without breaking things.

  Why do you have IBV_LINK_LAYER_UNSPECIFIED ? That seems pointless. Why
  should existing kernels return UNSPECIFIED when we know they are
  always IB. I'd say drop IBV_LINK_LAYER_UNSPECIFIED and set
  IBV_LINK_LAYER_INFINIBAND to 0.

I prefer having the UNSPECIFIED.  Not a strong preference but my
reasoning is that if you have an old kernel that doesn't answer
anything, you're better off letting the app see that.  And there's no
reason why iWARP has to run over ethernet in principle.

Maybe I'm wrong but I don't like setting don't know magically to IB
behind the scenes.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] mlx4: Fix bug in mlx4_ib_mcg_attach

2009-12-09 Thread Roland Dreier
This bug doesn't seem to ever have been present in the upstream
kernel -- what are you generating this patch against?

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] IPoIB: synchronize timer deletion with completion handler

2009-12-09 Thread Roland Dreier

  When calling del_timer_sync on priv-poll_timer, it is necessary to prevent
  farther arming of the timer which can be done by a completion handler. Though
  it is harmless since the timer will eventually stop being rearmed, it is 
  better
  practice to avoid calling the timer function after it is deleted. This patch
  handles this by using a new flag that is checked before arming the timer.

have you seen this in practice?  If it can happen then it's not
harmless, since the module could be unloaded with the timer pending.
However I don't see how it could happen, since we only seem to delete
the timer after we know that no more completions are coming (except for
the case where we decide that the hardware is wedged but it really only
takes a *long* time to respond at exactly the wrong time, and we somehow
get a completion between the del_timer_sync and the modify QP to reset
state -- which is so unlikely it seems not worth adding this extra
complexity for -- maybe we could add the del_timer_sync to after we
delete the CQ or something if you're really worried)
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] mlx4: Fix bug in mlx4_ib_mcg_attach

2009-12-09 Thread Roland Dreier

   This bug doesn't seem to ever have been present in the upstream
   kernel -- what are you generating this patch against?

  I think it came from your for-next branch.

I don't see anything touching this code there.  The patch that
introduced this code upstream, 521e575b (IB/mlx4: Add support for
blocking multicast loopback packets) doesn't have this bug and I don't
see anything else that changed that area of the code.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [ofw] SC'09 BOF - Meeting notes

2009-11-29 Thread Roland Dreier

  RFC 4291, Appendix A.

Thanks for the pointer.  As far as I can tell from reading some IPv6
stuff, it really is broken to try to go from a link-local IPv6 address
back to a L2 ethernet address.  For example, RFC 2464 (pointed to by RFC
4291) says:

Ethernet Address
   The 48 bit Ethernet IEEE 802 address, in canonical bit
   order.  This is the address the interface currently
   responds to, and may be different from the built-in
   address used to derive the Interface Identifier.

It really seems to be setting ourselves up for trouble not to use
neighbor discovery to map IPv6 addresses to link-layer addresses.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [ofw] SC'09 BOF - Meeting notes

2009-11-23 Thread Roland Dreier

  In any case, this is not a correctness issue that prohibits
  experimentation with rdmaoe multicast on any network today.

I agree -- nothing prevents experimentation.  I am just leery about
making invasive changes to the core stack in the absence of any
documented design for IBoE (that I've seen at least).

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [ofw] SC'09 BOF - Meeting notes

2009-11-19 Thread Roland Dreier

  How can 1500 lines out of 240k lines be a big change.. do I have these
  numbers right - is the
  big change you are referring too?

If there are significant changes to the core APIs -- and IBoE has
exactly this impact -- then yes it can be a big change even if the line
count is small.

  What is the risk area that you are worried about .. do you think it
  will break current
  transports or existing ULPs ?

I am worried that no one has thought through all the issues and corner
cases around address resolution, multicast, etc, and that when we do get
a standardized version of IBoE, we'll have to break core APIs yet again.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] RE: SC'09 BOF - Meeting notes and Final Slides

2009-11-19 Thread Roland Dreier

  WinOF has asynchronous interfaces for modify QP, and limited testing has 
  shown
  that it can improve connection times.  QP transitions are probably the second
  largest component of connection setup after the SA.  Since the RDMA CM 
  already
  provides an asynchronous interface, even existing apps may be able to be 
  able to
  gain some benefit from this.

can't anyone get async modify QP today on any platform by just doing the
operation in another thread (or thread pool)?  It seems that the
operations themselves are heavy enough that thread dispatch, locking etc
is going to be significant overhead.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [ofw] SC'09 BOF - Meeting notes

2009-11-19 Thread Roland Dreier

  Having lots of testing exposure can help in validating that all the
  edge cases are handled..

To some extent -- but there also needs to be some thinking involved to
make sure that the interface can actually handle future cases.

  Are there a set of cases that you have in mind ?

For example -- how is multicast going to interact with IGMP on ethernet
switches?  How is address resolution going to be done (current patches
seem to assume that stateless IPv6 link-local addresses contain the
ethernet address, which is not valid if RFC 3041 is used)?  etc

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] mlx4: remove limitation on LSO header size

2009-10-12 Thread Roland Dreier

 +   *blh = unlikely(halign  64) ? 1 : 0;

   This idiom of (boolean condition) ? 1 : 0 looks odd to me... doesn't
   (halign  64) already evaluate to 1 or 0 anyway?  Does the unlikely()
   actually affect code generation here?

  True, (halign  64) is the same and is cleaner. As for the unlikely()
  -- well it's already been there and besides, we're never sure if it
  will improve anything so the same question could be asked for other
  places in the code.

I was just wondering in this case where you are just assigning the
boolean value of the expression to a variable how unlikely affects
things.  I can understand for conditional jumps how the compiler can
choose to make the likely case more efficient, but when there are no
jumps then I was just curious how the hint could help.

   I assume this initialization is to avoid a compiler warning.  But the
   code is actually correct without initializing blh -- so I think that we
   save a tiny bit of code by doing uninitialized_var() instead?

  We must initialize blh since it is used for any send request and not
  just LSO opcodes. 

So then this patch was buggy because blh was not reinitialized as we
loop through multiple work requests?  eg an LSO request followed by a
non-LSO request?

Anyway I'll look over the newer patch.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] mlx4: remove limitation on LSO header size

2009-10-07 Thread Roland Dreier

  +*blh = unlikely(halign  64) ? 1 : 0;

This idiom of (boolean condition) ? 1 : 0 looks odd to me... doesn't
(halign  64) already evaluate to 1 or 0 anyway?  Does the unlikely()
actually affect code generation here?

With that said, see below...

  +int blh = 0;

I assume this initialization is to avoid a compiler warning.  But the
code is actually correct without initializing blh -- so I think that we
save a tiny bit of code by doing uninitialized_var() instead?

  +(blh ? cpu_to_be32(1  6) : 0);

...given that the only use of blh is as a flag to decide what constant
to use here, does it generate better code to make blh be __be32 and set
the value directly in build_lso_seg, ie do:

*blh = unlikely(halign  64) ? cpu_to_be32(1  6) : 0;

and then use blh without ?: in mlx4_ib_post_send...

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] rping is not resolving ipv6 addresses

2009-10-07 Thread Roland Dreier

  I have been testing rping with ipv6 address on OFED 1.5. I am seeing a
  problem with ipv6 address resolution. rping is not always sending
  neighbor solicitation out the correct interface. Running tcpdumps I
  discovered that the neighbor solicitation is being sent out the first
  interface that was configured, in my case eth0 and not the ib0
  interface.
  
  This is the test I am running:
  
  Host A:
  $ ip address show ib0 | grep inet6
  inet6 fe80::202:c903:1:1925/64 scope link
  
  $ rping -s -v -a ::0
  
  Host B:
  $ rping -c -v -a fe80::202:c903:1:1925
  
  cma event RDMA_CM_EVENT_ADDR_ERROR, error -110
  waiting for addr/route resolution state 1
  
  
  Using tcpdump I verified the neighbor solicitation packets went out the
  eth0 interface not ib0. If I ifdown eth0 I see the neighbor
  solicitation go out ib0 and the rping works. BTW: Ping6 works ok;
  however, it requires that I specify the interface to use on the command
  line: $ ping6 fe80::202:c903:1:1925%ib0

I'm not sure how many places in the stack have to be fixed for this, but
for this to work certainly at least rping needs a way to specify which
interface to use.  There is no routing for IPv6 link-local addresses
that says which device to use... a given address with prefix fe80: could
conceivably exist on multiple links (and correspond to different
destinations on each link), and the networking stack has no way to know
which link to use unless you tell it somehow.  (This is why ping6
requires the %ib0 specifier)
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH] IB/ehca: Fix CQE flags reporting

2009-09-01 Thread Roland Dreier
applied, thanks
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH] IB/ehca: Construct MAD redirect replies from request MAD

2009-08-31 Thread Roland Dreier
this seems reasonable to me, applied, thanks.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] IB/ehca: Construct MAD redirect replies from request MAD

2009-08-28 Thread Roland Dreier

  Given that you seem to like the rest of the code and Jason hasn't spoken 
  up yet, I think we can have Roland merge this patch. Roland, what do you 
  think?

I don't see any problem with the idea and this does sound like a step
forward, so I am planning on merging this (pending review).
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH] mlx4_core: map sufficient ICM memory for EQs

2009-08-07 Thread Roland Dreier
Thanks, applied with a few cleanups:
  ilog2(roundup_pow_of_two())  -  order_base_2()
  xxx * (1  yy)  -  xxx  yy
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ofa-general] Re: [ewg] [Patch mthca backport] Don't use kmalloc 128k

2009-07-27 Thread Roland Dreier

   And I don't think the upstream kernel has that limit on kmalloc size
   either (at least with SLUB, not sure about SLAB).
  
  This patch was actually written as an emulation of the upstream SLUB
  behavior, which is exactly the same thing: on large allocations
  forward to __g_f_p().  See include/linux/slub_def.h's definition of
  kmalloc_large and kmalloc.

Right.  But does upstream SLAB also pass through to the page allocator
the same as SLUB?  How about SLQB?

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ofa-general] Re: [ewg] [Patch mthca backport] Don't use kmalloc 128k

2009-07-23 Thread Roland Dreier

  This will fix the 2^20 bits limit on our bitmaps once and for all.

Not really... since getting  128KB of contiguous memory is likely to
fail anyway.

And I don't think the upstream kernel has that limit on kmalloc size
either (at least with SLUB, not sure about SLAB).

Really the long-term fix is to handle non-contiguous memory in the
bitmap allocator.  maybe using vmalloc(), although I always hate big
allocations with vmalloc too.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH] ehca: use port autodetect mode as default

2009-07-08 Thread Roland Dreier
looks good, applied
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] install.pl: Install QIB driver instead of Ipath

2009-07-08 Thread Roland Dreier

  This patch installs the qib driver which replaces the ipath driver
  in OFED-1.5.

Maybe I missed some discussion of this.

But what is the QIB driver?  What are you planning to do to support
qlogic HCAs for the mainline kernel?

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [PATCH 2.6.31 try 2] ehca: Tolerate dynamic memory operations and huge pages

2009-06-23 Thread Roland Dreier
applied...
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [PATCH 2.6.31 try 2] ehca: Tolerate dynamic memory operations and huge pages

2009-06-22 Thread Roland Dreier
thanks, applied.

  -#define HCAD_VERSION 0026
  +#define HCAD_VERSION 0027

the driver version was already 0027 (since bde2cfaf), so I dropped this chunk.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [PATCH 2/9] ib_core: kernel API for GID -- MAC translations

2009-06-18 Thread Roland Dreier

  on top of which trees would like me to rebase the patches on and update
  the ABI version?

I guess you could base it on my xrc branch but as I said I'm not
planning on merging this until I'm done with the XRC stuff.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] [PATCH 2/9] ib_core: kernel API for GID --MAC translations

2009-06-17 Thread Roland Dreier

  Hum, This is a very tricky subject. Co-mingling the IB GID address
  space and the IPv6 address space like this is not really something
  that was envisioned from the IBA side.

Doesn't the IB spec say that an IB GID *is* an IPv6 address?  So in
theory it should be OK; however I don't think in practice anyone paid
attention to making sure that the IB GID space works as an IPv6 address.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] [PATCH 2/9] ib_core: kernel API for GID --MAC translations

2009-06-17 Thread Roland Dreier

  It is like an IPv6 address but it was expressly envisioned to be a
  seperate space. The IBA authors copied many of the conventions from
  IPv6 for numbering this new space, like link local, and multicast
  prefixes, but it was not intended to be co-mingled.

Well (I've quoted this many times): IBA section 4.1:

   A GID is a valid 128-bit IPv6 address (per RFC 2373) with additional
properties / restrictions defined within IBA...

People often try to claim that this sentence doesn't mean what it very
explicitly and clearly says, and certainly I believe that existing
practice doesn't comply with the IBA spec, but I don't see how anyone
can say that a truly compliant IB GID is not a real IPv6 address.

  So, I didn't look closely enough, but what was the ethertype that is
  used here in this patch set? Hopefully not IPv6.

I don't think it's specified in the code -- presumably in HCA FW.  Which
is an issue as you say -- do we have an IEEE ethertype yet?  And if we
don't use the IPv6 ethertype, then is multicast going to work well (if
the code is moved away from just broadcasting everything)?  I doubt that
switch IGMP snooping works well for non-IP ethertypes -- in fact I
wonder how well existing switches handle IPv6 multicast ;)

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] [PATCH 2/9] ib_core: kernel API for GID --MAC translations

2009-06-17 Thread Roland Dreier

  Hmm, murky indeed. Your point about IGMPv6 is well made. The problem
  is that IB GRHs are not IPv6 headers and have different numerology for
  the Next Header field. Ie in IPv6 Next Header 0x1B is RFC 908, while
  in GRH it is a BTH. Labeling GRHs with an IPv6 ethertype is
  fundamentally wrong.

Yes, but the next header is the only issue I know of.  Since 0x1b is
already assigned as an IPv6 next header protocol, we would have to get a
new value assigned.  However once a non-conflicting value is chosen,
then an IB GRH really is an IPv6 header and in that case I think using
the IPv6 ethertype too would make things work much better -- eg IB
traffic actually could be forwarded by an IPv6 router with no additional
work required.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages

2009-06-16 Thread Roland Dreier

  Yeah, the notifier code remains untouched as we still do not allow dynamic
  memory operations _while_ our module is loaded. The patch allows the driver 
  to
  cope with DMEM operations that happened before the module was loaded, which
  might result in a non-contiguous memory layout. When the driver registers
  its global memory region in the system, the memory layout must be considered.
  
  We chose the term toleration instead of support to illustrate this.

I see.  So things just silently broke in some cases when the driver was
loaded after operations you didn't tolerate?

Anyway, thanks for the explanation.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ofa-general] RE: [ewg] [PATCH 0/9] RDMAoE - RDMA over Ethernet-- some procedural questions

2009-06-16 Thread Roland Dreier

   I think that in any case, OFA needs to have a consistent policy, and
   if we allow something that is not a standard for one member, it
   should be allowed for all members.

  Agreed; I think that this is my central point -- thanks for saying it
  succinctly!  Regardless of whether OF asked Oracle to submit RDS or
  not, it's not associated with any standard (I'm not picking on RDS;
  I'm picking on the OF rules and selectively applying them).
  Therefore, either the bylaws are wrong of OF/EWG is wrong.

The bylaw in question seems pretty silly given the lack of control or
involvement that OFA has with Linux kernel development.  Given that RDS
is not standardized at the API or wire level and given that RDS is
included in the Linux kernel, what options does the OFA have for
enforcing its bylaws?  Removing RDS from OFED?  Once OFED has *fewer*
features than the standard kernel it becomes pretty pointless; maybe the
logical conclusion is that OFA should get out of the distribution
business (my feelings about ending OFED are well-known I think).

The same applies to IBoE; if (and that really is if since I don't
think a conclusion about merging IBoE support into the kernel has been
reached) IBoE goes into the kernel but OFED can't or won't distribute
it, then the relevance of OFED becomes marginal.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [ofa-general] [PATCH 0/9] RDMAoE - RDMA over Ethernet

2009-06-16 Thread Roland Dreier

  If I might chime in here...I've been working to actively squash the
  expression 'IBoE' or any variation that includes IB in the name.  The reason
  is because the InfiniBand Architecture is defined as a cohesive solution
  that includes five layers of the OSI stack.  Hence using the expression IBoE
  creates unnecessary confusion by implying that RDMAoE includes elements of
  IB from the five OSI layers, which is not an accurate description of the
  proposal.

Seems pretty silly to me -- if I build an IB router that carries IB over
the WAN on MPLS, is it not InfiniBand because it uses the wrong layer 1?
Similarly the worlds seems to cope with talking about FCoE rather than
SCSIoE even though FCoE does not share all the OSI layers of fibre channel.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages

2009-06-12 Thread Roland Dreier
OK, one major issue with this patch and a few minor nits.

First, the major issue is that I don't see anything in the patch that
changes the code in ehca_mem_notifier() in ehca_main.c:

case MEM_GOING_ONLINE:
case MEM_GOING_OFFLINE:
/* only ok if no hca is attached to the lpar */
spin_lock_irqsave(shca_list_lock, flags);
if (list_empty(shca_list)) {
spin_unlock_irqrestore(shca_list_lock, flags);
return NOTIFY_OK;
} else {
spin_unlock_irqrestore(shca_list_lock, flags);
if (printk_timed_ratelimit(ehca_dmem_warn_time,
   30 * 1000))
ehca_gen_err(DMEM operations are not allowed
 in conjunction with eHCA);
return NOTIFY_BAD;
}

But your patch description says:

  This patch implements toleration of dynamic memory operations

But it seems you're still going to hit the same NOTIFY_BAD case above
after your patch.  So something doesn't compute for me.  Could you
explain more?

Second, a nit:

  +#define EHCA_REG_MR 0
  +#define EHCA_REG_BUSMAP_MR (~0)

and you pass these as the reg_busmap parm in:

   int ehca_reg_mr(struct ehca_shca *shca,
   struct ehca_mr *e_mr,
   u64 *iova_start,
  @@ -991,7 +1031,8 @@
   struct ehca_pd *e_pd,
   struct ehca_mr_pginfo *pginfo,
   u32 *lkey, /*OUT*/
  -u32 *rkey) /*OUT*/
  +u32 *rkey, /*OUT*/
  +int reg_busmap)

and test it as:

  +if (reg_busmap)
  +ret = ehca_reg_bmap_mr_rpages(shca, e_mr, pginfo);
  +else
  +ret = ehca_reg_mr_rpages(shca, e_mr, pginfo);

So the ~0 for true looks a bit odd.  One option would be to make
reg_busmap a bool, since that's how you're using it, but then you lose
the nice self-documenting macro where you call things.

So I think it would be cleaner to do something like

enum ehca_reg_type {
EHCA_REG_MR,
EHCA_REG_BUSMAP_MR
};

and make the int reg_busmap parameter into enum ehca_reg_type reg_type
and have the code become

+   if (reg_type == EHCA_REG_BUSMAP_MR)
+   ret = ehca_reg_bmap_mr_rpages(shca, e_mr, pginfo);
+   else if (reg_type == EHCA_REG_MR)
+   ret = ehca_reg_mr_rpages(shca, e_mr, pginfo);
+   else
+   ret = -EINVAL

or something like that.

  +struct ib_dma_mapping_ops ehca_dma_mapping_ops = {
  +.mapping_error = ehca_dma_mapping_error,
  +.map_single = ehca_dma_map_single,
  +.unmap_single = ehca_dma_unmap_single,
  +.map_page = ehca_dma_map_page,
  +.unmap_page = ehca_dma_unmap_page,
  +.map_sg = ehca_dma_map_sg,
  +.unmap_sg = ehca_dma_unmap_sg,
  +.dma_address = ehca_dma_address,
  +.dma_len = ehca_dma_len,
  +.sync_single_for_cpu = ehca_dma_sync_single_for_cpu,
  +.sync_single_for_device = ehca_dma_sync_single_for_device,
  +.alloc_coherent = ehca_dma_alloc_coherent,
  +.free_coherent = ehca_dma_free_coherent,
  +};

I always think structures like this are easier to read if you align the
'=' signs.  But no big deal.

  +ret = ehca_create_busmap();
  +if (ret) {
  +ehca_gen_err(Cannot create busmap.);
  +goto module_init2;
  +}
  +
   ret = ibmebus_register_driver(ehca_driver);
   if (ret) {
   ehca_gen_err(Cannot register eHCA device driver);
   ret = -EINVAL;
  -goto module_init2;
  +goto module_init3;
   }
   
   ret = register_memory_notifier(ehca_mem_nb);
   if (ret) {
   ehca_gen_err(Failed registering memory add/remove notifier);
  -goto module_init3;
  +goto module_init4;

Having to renumber unrelated things is when something changes is why I
don't like this style of error path labels.  But I think it's well and
truly too late to fix that in ehca.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH] IB/ehca: Remove superfluous bitmasks from QP control block

2009-06-03 Thread Roland Dreier
looks fine, applied for 2.6.31
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH 2/2] ib_mthca: Use module parameter for number of MTTs per segment

2009-05-27 Thread Roland Dreier
Sigh... unfortunate to add a tunable that people have to mess with
rather than just making things work automatically somehow.  Anyway,
applied these.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [PATCH 1/3] IB/ehca: Replace vmalloc with kmalloc

2009-04-28 Thread Roland Dreier
  did you have a chance to take a look at the patchset and will you apply it, 
  or
  are there any outstanding issues we need to address?

I guess it's OK, but definitely 2.6.31 material.  I guess I'll stick it
linux-next soon.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH 1/3] IB/ehca: Replace vmalloc with kmalloc

2009-04-28 Thread Roland Dreier
thanks, applied.

  From: Anton Blanchard antonb at au1.ibm.com
  Signed-off-by: Stefan Roscher stefan.roscher at de.ibm.com

please use '@' signs so these are real email addresses.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH 3/3] IB/ehca: Increment version number

2009-04-28 Thread Roland Dreier
thanks, applied 2  3.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH 0/3] IB/ehca: Perfomance improvment for creation of queue pairs

2009-04-21 Thread Roland Dreier
  Because of this fundamental code change we will also increment the version 
  number.

So this is 2.6.31-targeted stuff I assume?  (Too late in the 2.6.30
cycle for fundamental code change of course)

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH 1/3] IB/ehca: Replace vmalloc with kmalloc

2009-04-21 Thread Roland Dreier
  +queue-queue_pages = kmalloc(nr_of_pages * sizeof(void *), GFP_KERNEL);

How big might this buffer be?  Any chance of allocation failure due to
memory fragmentation?

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] [PATCH] ib_mad: Fix RMPP header RRespTime manipulation

2009-02-27 Thread Roland Dreier
thanks, applied.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ofa-general] Re: [ewg] [PATCH] ib_core: save process's virtual address in struct ib_umem

2009-01-29 Thread Roland Dreier
  I see... you're right - no need to stick the address into struct
  ib_umem. Following this email is a new patch for mlx4_ib only. I
  excluded support for both powerpc and ia64 since I could not find a
  way to get HPAGE_SIZE (or HPAGE_SHIFT) for them.

#include asm/page.h ?

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ofa-general] Re: [ewg] [PATCH] ib_core: save process's virtual address in struct ib_umem

2009-01-29 Thread Roland Dreier
  It does not help. The problem with powerpc is that HPAGE_SHIFT is an
  unexported variable and for ia64 it's hpage_shift.

I see.  hpage_shift is exported on ia64, so that should be OK.  And I
guess for powerpc it is just a matter of adding an export so we can use
it in a module.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] ib_core: save process's virtual address in struct ib_umem

2009-01-26 Thread Roland Dreier
  add address field to struct ib_umem so low level drivers will have this
  information which may be needed in order to correctly calculate the number of
  huge pages.

seems like a really strange thing to do:

+   umem-address   = addr;

this value addr is coming from the low-level driver, so I'm not clear
why we need to stick it into umem?  What am I missing?

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH v2] mlx4_ib: Optimize hugetlab pages support

2009-01-26 Thread Roland Dreier
  +n = PAGE_ALIGN(umem-length + (umem-address  ~HPAGE_MASK))  
  HPAGE_SHIFT;

This is still wrong I think.  What if the user, say, registers 1MB that
is aligned exactly at a 2MB huge page start?  Then wouldn't this
expression set n to 0?  I think that PAGE_ALIGN() needs to be a
HPAGE_ALIGN(), although of course that doesn't actually exist, so we
would need to do ALIGN(..., HPAGE_SIZE) instead... but given that
we're just going to do  HPAGE_SHIFT afterwards, maybe
DIV_ROUND_UP(..., HPAGE_SIZE) is a better way to write it.

It maybe is needed to make sure that the virtual address requested is
aligned appropriately for using bigger pages (although this should
always be the case in the current code I guess, since the userspace
virtual address always matches the HCA MR virtual address).

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ofa-general] Re: [ewg] [PATCH] ib_core: save process's virtual address in struct ib_umem

2009-01-26 Thread Roland Dreier
  It is has to be saved either at the low level driver's mr object,
  e.g. struct mlx4_ib_mr, or at a common place like struct ib_umem. Do
  you prefer that it will be saved in struct mlx4_ib_mr?

I don't see why it has to be saved anywhere?  The only place you use
umem-address is in handle_hugetlb_usermr(), and you could just as
easily pass in start directly as a parameter (since
mlx4_ib_reg_user_mr() has that value in a parameter anyway).

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH v1] mlx4_ib: Optimize hugetlab pages support

2009-01-22 Thread Roland Dreier
OK, looks better.  However the patch had a bunch of whitespace problems
(run checkpatch.pl to see them).  Also:

  +static int handle_hugetlb_user_mr(struct ib_pd *pd, struct mlx4_ib_mr *mr,
  +  u64 virt_addr, int access_flags)
  +{
  +#ifdef CONFIG_HUGETLB_PAGE
  +struct mlx4_ib_dev *dev = to_mdev(pd-device);
  +struct ib_umem_chunk *chunk;
  +unsigned dsize;
  +dma_addr_t daddr;
  +unsigned uninitialized_var(cur_size);
  +dma_addr_t uninitialized_var(cur_addr);
  +int restart;
  +int n;
  +struct ib_umem  *umem = mr-umem;
  +u64 *arr;
  +int err = 0;
  +int i;
  +int j = 0;
  +
  +n = PAGE_ALIGN(umem-length + umem-offset)  HPAGE_SHIFT;

seems this might underestimate by 1 if the region doesn't start/end on a
huge-page aligned boundary (but we would still want to use big pages to
register it).

  +arr = kmalloc(n * sizeof *arr, GFP_KERNEL);
  +if (!arr)
  +return -ENOMEM;
  +
  +restart = 1;
  +list_for_each_entry(chunk, umem-chunk_list, list)
  +for (i = 0; i  chunk-nmap; ++i) {
  +daddr = sg_dma_address(chunk-page_list[i]);
  +dsize = sg_dma_len(chunk-page_list[i]);
  +if (restart) {
  +cur_addr = daddr;
  +cur_size = dsize;
  +restart = 0;
  +} else if (cur_addr + cur_size != daddr) {
  +err = -EINVAL;
  +goto out;
  +} else
  +cur_size += dsize;
  +
  +if (cur_size  HPAGE_SIZE) {
  +err = -EINVAL;
  +goto out;
  +} else if (cur_size == HPAGE_SIZE) {
  +restart = 1;
  +arr[j++] = cur_addr;
  +}
  +}

I think we could avoid the uninitialized_var() stuff and having restart
at all by just doing cur_size = 0 at the start of the loop, and then
instead of if (restart) just test if cur_size is 0.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH] mlx4_ib: Optimize hugetlab pages support

2009-01-21 Thread Roland Dreier
  Since Linux does not merge adjacent pages into a single scatter entry through
  calls to dma_map_sg(), we do this for huge pages which are guaranteed to be
  comprised of adjacent natural pages of size PAGE_SIZE. This will result in a
  significantly lower number of MTT segments used for registering hugetlb 
  memory
  regions.

Actually on some platforms, adjacent pages will be put into a single sg
entry (eg look at VMERGE under arch/powerpc).  So your
mlx4_ib_umem_write_huge_mtt() function won't work in such cases.  In any
case relying on such a fragile implementation detail with no checking is
not a good idea for future maintainability.  So this needs to be
reimplemented to work no matter how the DMA mapping layer presents the
hugetlb region (and also fail gracefully if the bus addresses returned
from DMA mapping don't work with the huge page size -- since it would be
correct for the DMA mapping layer to map every PAGE_SIZE chunk to an
arbitrary bus address, eg if the IOMMU address space becomes very
fragmented)

  @@ -142,15 +176,34 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, 
  u64 start, u64 length,
   
   n = ib_umem_page_count(mr-umem);
   shift = ilog2(mr-umem-page_size);
  -
  -err = mlx4_mr_alloc(dev-dev, to_mpd(pd)-pdn, virt_addr, length,
  +if (mr-umem-hugetlb) {
  +nhuge = ALIGN(n  shift, HPAGE_SIZE)  HPAGE_SHIFT;
  +shift_huge = HPAGE_SHIFT;
  +err = mlx4_mr_alloc(dev-dev, to_mpd(pd)-pdn, virt_addr, 
  length,
  +convert_access(access_flags), nhuge, 
  shift_huge, mr-mmr);
  +if (err)
  +goto err_umem;
  +
  +err = mlx4_ib_umem_write_huge_mtt(dev, mr-mmr.mtt, mr-umem, 
  nhuge);
  +if (err) {
  +if (err != -EAGAIN)
  +goto err_mr;
  +else {
  +mlx4_mr_free(to_mdev(pd-device)-dev, 
  mr-mmr);
  +goto regular_pages;
  +}
  +}
  +} else {
  +regular_pages:
  +err = mlx4_mr_alloc(dev-dev, to_mpd(pd)-pdn, virt_addr, 
  length,
   convert_access(access_flags), n, shift, mr-mmr);
  -if (err)
  -goto err_umem;
  +if (err)
  +goto err_umem;
   
  -err = mlx4_ib_umem_write_mtt(dev, mr-mmr.mtt, mr-umem);
  -if (err)
  -goto err_mr;
  +err = mlx4_ib_umem_write_mtt(dev, mr-mmr.mtt, mr-umem);
  +if (err)
  +goto err_mr;
  +}
   
   err = mlx4_mr_enable(dev-dev, mr-mmr);
   if (err)

Also I think this could be made much clearer by putting the huge page
handling code in a separate function like mlx4_ib_reg_hugetlb_user_mr()
and then you could just do something like

if (!mr-umem-hugetlb || mlx4_ib_reg_hugetlb_user_mr(...)) {
// usual code path
}

rather than having to use a goto into another block.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: kernel panic while using ipoib with lro enabled

2008-12-29 Thread Roland Dreier
  - I've added a print of the skb-data_len and skb-len sizes right
  before the BUG_ON line.
skb-data_len was always 0, so I don't understand why there is a kernel 
  panic.

A race of two CPUs doing an skb_pull or something like that?

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: kernel panic while using ipoib with lro enabled

2008-12-29 Thread Roland Dreier
By the way, I don't seem to be able to reproduce this easily in my
setup.  What kind of HCA/system are you using?

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: kernel panic while using ipoib with lro enabled

2008-12-28 Thread Roland Dreier
  We've discovered that sometimes we get a kernel panic while using
  ipoib with lro enabled.
  You can find all the details here:
  
  https://bugs.openfabrics.org/show_bug.cgi?id=1473
  
  This bug is reproduced both in kernel 2.6.27 and OFED 1.4.

I'll try to reproduce it myself, but in the meantime, can you reproduce
this with 2.6.28 or even better latest upstream Linus's git tree
(without OFED)?

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH] IPoIB: fix race when manipulating mcast SKBs queue

2008-12-17 Thread Roland Dreier
  ipoib_mcast_free() dequeues SKBs pending on the pkt_queue but needs to do 
  that
  with netif_tx_lock_bh() acquired.

I don't see why this would be required.  When ipoib_mcast_free() runs,
the mcast structure has been removed from all lists and I don't see how
any other context could simultaneously be adding packets to pkt_queue.
What is the race that you think this fixes?

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH] IB/ehca: Change misleading error message

2008-11-25 Thread Roland Dreier
  The error message printed when the eHCA driver prevents memory hotplug is
  misleading -- the user might think that hot-removing the lhca, hotplugging
  memory, then hot-adding the lhca again will work, but it doesn't.

That's too bad... I applied this patch but out of curiousity, why
doesn't the hot-remove/hot-add work?  I would have thought that
re-registering all of memory after the hot-add would do the right thing.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH] IB/ehca: Fix locking for shca_list_lock

2008-11-21 Thread Roland Dreier
Looks good... I'll add this for 2.6.29, since as far as I can tell this
bug has been there approximately forever already.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


  1   2   3   >