Gleb Natapov wrote:
On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote:
SSQ is needed for scalability, no need to explain this (by the way RD is needed for the same reason too. What's Mellanox plan to support it?
RD is not supported in hardware today. Implementing RD is extremely complicated. To solve the scalability issues on MPI like applications
we believe that SRC and SSQ are the right solutions. It is much simpler
for implementation by both software and hardware. By MPI-like I refer
to applications that have some level of trust between two processes of
the
same application. RD also has some performance issues as it only supports one message in the air. Those performance issues are solved
by design in SRC/SSQ.

Didn't know about RD limitation. Is this shortcomings of IB spec or
general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ.

The RD limitation is part of the IB spec.

It is a part of Spec after all, so why to invent new shiny staff when it is still possible to achieve better scalability without them).
It's truly about complexity. And as I mentioned in OFA meeting at
Sonoma, Mellanox is willing to contribute SRC/SSQ to the IB spec as well.

We are discussing you implementation proposal and in my opinion it doesn't fit application needs. I may be wrong here, so if there is somebody who things that sending random completion to random processes it the best idea ever and absence of this "feature" is the only thing that stops him from IB adoption he may chime in here and voice his opinion.
Your input about how to demultiplex send completions on SSQ is valuable. Unfortunately it is not supported in the current generation.
What I can suggest here is, not new on this thread, but:
1) all pollers see the same CQ, only the poller that sees the completion
that
      belongs to takes it out of the CQ
Progress of one process depend on all other processes on the same node. Not
good at all.
In MPI, it happens many times that all processes depends on each other to make forward progress, this way or the other. I am not saying that this is the ideal solution, but there is some price involved in sharing resources. You can always upgrade resources for a process that utilizes them, e.g. if communication pattern is that each process talks with 4 neighbors, then let it has dedicated unshared QPs.
2) only one process polls the CQ, if it doesn't belong to the poller,
the
poller will put it in a SW queue to the right process. The other processes just poll on the SW queue
Not good of the same reason.

As the variant each process can poll HW CQ and SW CQ if completion from HW CQ
belong to another process put it on appropriate SW CQ. I don't think
that reasonable API will require such afford from applications (and I am
not talking about all locking overhead and cache bouncing that will
result from such implementation, but latency will be bad that's for sure).
I don't think that polling on SQ completions are in the latency path. You usually need it in order to free networking buffers. In any case I understand your point.
3) the SQ will have a "completed WQE index" reported. Everybody can
     look at it and determine how many WQEs completed. This one has
some cons because the CQ is not shared here... need to bake this one more.
And where application will get WC? Or should it maintain its own queue
of WQEs?
In this method, each app should have its own queue.
If we wrap one of these into the right API, once there is HW available
that can do the SSQ CQ demultiplexing, it can work without any API change.
That is something I don't see in proposed API.

Looking at the Dror's slides on slide 6 "Scalable Reliable Connection" I see that wire protocol is extended to send DST SRQ as part of a header. Receiver side then puts completion to appropriate CQ according this field. Have you proposition address this? How?
SRC indeed includes demultiplexing of the CQ. SSQ does not currently,
unfortunately.
Is it possible to add this only with FW upgrade?
Unfortunately no.
But I think that with the right API we can abstract this, and later on
have better performance for it.

Who will put this additional data on a wire (HW or libibverbs may be app)? Also I don't see this in Dror's slide, but completion of local operation should be demultiplexed to appropriate CQ too. WQE may contain additional field, for instance, that will tell where to put a completion. Once again who will do the demux in you proposition (HW, libiverbs or app)? The right answer is most certainly HW in both cases so will Hermon support this? Or may be you want to demultiplex everything inside libibvers? In this case I want to see design of this (preferably with performance analysis).
One thing to mention. The way I see it is according to the order of the
slides. First get SRC going, improve the scalability. Then SSQ can be
added to further improve scalability. In other words I am suggesting
that maybe we can worry with the SSQ deficiencies a bit later :)

That is my point! Let's do it once lets do it right and lets do it when HW
is ready :)
SRC is ready in HW, it can be implemented in SW now and will significantly help scalability.
We can resume SSQ discussion or other alternatives later on...
--
                        Gleb.
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to