[ofa-general] Re: [RFC PATCH] rds: enable rdma on iWARP

Steve Wise Mon, 28 Jul 2008 10:55:14 -0700

Olaf Kirch wrote:

On Monday 28 July 2008 17:29:20 Jon Mason wrote:

This bulk of this patch is removing the pre-existing posting of the invalidate
logic and adding it prior to the fastreg send posting.  The previous logic
assumed that posting an invalidate to a dummy qp would successfully invalidate
the entry.  Unfortunately, the invalidate must be posted on the same qp as the
fastreg and the pre-existing logic does not have a way to get the qp the fastreg
is posted on.

This isn't quite correct. The invalidate must be posted on a connectedqp in the same pd. But it doesn't have to be the same qp as thefastreg. However if you use different qps, then you must coordinatethat you're done using the mr before invalidating it.

Then I don't see how this is going to work, ever. When the Oracle IPC
creates an MR, we do not know yet with which peer it wants to use it.
And in fact it may want to use the same MR with several peers... I'm
not sure about that detail but I think that's the case.

First off, there's a semantic snag here which I wanted to avoid by
*not* pairing the inval with the remap. Essentially, if the application
calls FREE_MR with the invalidate flag set, it actually expects *all*
previously freed MRs to be invalidated. In the FMR world, this
amounts to unmapping all FMRs on the dirty list, and batch destroying
them - this means we clean up the host side data structures, then issue
a SYNC_TPT, and we're done. Not fast, but if you get good batching it
doesn't slow you down too much.

Now when you pair remaps and invalidates, you will get MRs that are
in the process of being remapped, but the LOCAL_INV hasnt completed
yet - so you need to add a lot of tracking for these. I tried,
and it became very ugly very fast. That's why I used separate code
paths for remap and invalidate - and as far as I understand there's
no problem with that. The r_key's remap counter gets incremented
every time you map something, so you essentially get a different
r_key each time. Am I correct that with this approach, you can have
a RKEY(b, v) made up of a base stag b and a version counter v.
You can map
        MAP RKEY(b, 0) -> some memory
        MAP RKEY(b, 1) -> some other memory
        MAP RKEY(b, 2) -> yet some other memory
        INVAL RKEY(b, 0)
        INVAL RKEY(b, 1)


No you must invalidate the MR between fastreg calls.  Like this:

FASTREG RKEY(b, 0) -> some memory
INVALIDATE RKEY(b, 0)
FASTREG RKEY(b, 1) -> some other memory
INVALIDATE RKEY(b, 1)
FASTREG RKEY(b, 2) -> yet some other memory

If you post them all on the same QP then you can use fencing to keep thepipeline full. If you want to use different qps for the invalidates,then you must manage that you invalidate only when you're done using them.

and so on? Or does the HCA driver keep pointers to the caller's
data structures around somewhere so that repeated MAP requests
without intervening INVAL would lead to corruption?

It doesn't have to do with the callers data structs. The simple factis, you cannot fast register a mr to more than one pbl at a time. Ifyou think about adapter resources, there is a single MR entry for thefast reg MR and one PBL entry for whatever the pbl is for that currentmapping.

If that is the case, I would leave the approach of a separate map
and inval in place, because free+invalidate becomes rather simple
with this: you just post all the inval requests and wait for them
to complete.

Note you can just dereg the MR to invalidate the last mapping. IE youdon't need to post an invalidate if you are going to call ib_dereg_mr()to destroy the fast reg mr.

My original approach was rather simplistic, in that I wanted to
post the INVAL request to just a single dummy QP. If that doesn't
work (which I think is a deficiency of the interface) then we
need to record the original rds_conn somewhere with the mapping,
so that we know which QP to post it to.

First off, any send WR posted to a QP not in RTS does nothing. ForiWARP QPs, you only enter into RTS when the QP is connected. So a dummyQP just won't work. We can argure about the interface deficiencies ifyou want, but the semantics are part of the IBTA and iWARP specs, so weprobably shouldn't change it much.

However, I still have doubts all of this will work very well.
If you have to pipeline R_Key invalidations to a variety of QPs,
you may face QPs that are heavily contended - actually so much
that you may not even be able to get a single inval request onto
the queue because the application keeps hogging the pipe with SENDs
or other transactions. IOW a single SEND intensive application can
starve another app calling FREE+invalidate almost indefinitely.

I would think a single SEND intensive app could starve other apps tryingto use the same QP anyway, so you must have some sort of fairness logic, eh?

Second, what do you do if a QP errors out? Does that render all
R_Keys issued previously on that QP invalid?

No. the R_Key's aren't tied to the QP except that if you have pendingfastreg or invalidate WRS, then it is tied to that QP until the WRscomplete.

That sounds like a
real knockout problem to me. Imagine 1000 processing doing RDMA,
all of them busily obtaining r_keys for some mapping, and asking
the remote to rdma to/from that memory. Now thanks to an application
bug, *one* transfer refers to an r_key that's bogus. Now your
connection goes down with remote access error. Big deal, RDS
will just nix all outstanding RDMA transfers, reconnect and create
a new QP.

Now, if it is actually as you say and memory registrations are
bound to the QP they were created on, this means all previously
created mappings have been invalidated at once. What happens next?
Some applications will obtain a fresh mapping and retry their
RDMA. Others will have been lucky, and didn't have a RDMA in flight
that was dropped on the floor - they will initiate a RDMA with
a r_key that is suddenly no longer valid! Guess what happens - the
connection goes down again!

This looks a lot like network chernobyl to me.
Or, if you will, a design flaw. A mapping obtained on a
given QP should be usable with other QPs bound to the same
device, and you should be able to invalidate it on any QP
bound to the same device.


It is.  You can.

Olaf


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [RFC PATCH] rds: enable rdma on iWARP

Reply via email to