On 2014/11/19 1:49, Eric Windisch wrote:
I think for this cycle we really do need to focus on consolidating and
testing the existing driver design and fixing up the biggest
deficiency (1) before we consider moving forward with lots of new
+1
1) Outbound messaging connection re-use - right now every outbound
messaging creates and consumes a tcp connection - this approach scales
badly when neutron does large fanout casts.
I'm glad you are looking at this and by doing so, will understand the
system better. I hope the following will give some insight into, at
least, why I made the decisions I made:
This was an intentional design trade-off. I saw three choices here:
build a fully decentralized solution, build a fully-connected network,
or use centralized brokerage. I wrote off centralized brokerage
immediately. The problem with a fully connected system is that active
TCP connections are required between all of the nodes. I didn't think
that would scale and would be brittle against floods (intentional or
otherwise).
IMHO, I always felt the right solution for large fanout casts was to
use multicast. When the driver was written, Neutron didn't exist and
there was no use-case for large fanout casts, so I didn't implement
multicast, but knew it as an option if it became necessary. It isn't
the right solution for everyone, of course.
Using multicast will add some complexity of switch forwarding plane that
it will enable and maintain multicast group communication. For large
deployment scenario, I prefer to make forwarding simple and
easy-to-maintain. IMO, run a set of fanout-router processes in the
cluster can also achieve the goal.
The data path is: openstack-daemon --------send the message (with
fanout=true) ---------> fanout-router -----read the matchmaker------>
send to the destinations
Actually it just uses unicast to simulate multicast.
For connection reuse, you could manage a pool of connections and keep
those connections around for a configurable amount of time, after
which they'd expire and be re-opened. This would keep the most
actively used connections alive. One problem is that it would make the
service more brittle by making it far more susceptible to running out
of file descriptors by keeping connections around significantly
longer. However, this wouldn't be as brittle as fully-connecting the
nodes nor as poorly scalable.
+1. Set a large number of fds is not a problem. Because we use socket
pool, we can control and keep the fixed number of fds.
If OpenStack and oslo.messaging were designed specifically around this
message pattern, I might suggest that the library and its applications
be aware of high-traffic topics and persist the connections for those
topics, while keeping others ephemeral. A good example for Nova would
be api->scheduler traffic would be persistent, whereas
scheduler->compute_node would be ephemeral. Perhaps this is something
that could still be added to the library.
2) PUSH/PULL tcp sockets - Pieter suggested we look at ROUTER/DEALER
as an option once 1) is resolved - this socket type pairing has some
interesting features which would help with resilience and availability
including heartbeating.
Using PUSH/PULL does not eliminate the possibility of being fully
connected, nor is it incompatible with persistent connections. If
you're not going to be fully-connected, there isn't much advantage to
long-lived persistent connections and without those persistent
connections, you're not benefitting from features such as heartbeating.
How about REQ/REP? I think it is appropriate for long-lived persistent
connections and also provide reliability due to reply.
I'm not saying ROUTER/DEALER cannot be used, but use them with care.
They're designed for long-lived channels between hosts and not for the
ephemeral-type connections used in a peer-to-peer system. Dealing with
how to manage timeouts on the client and the server and the swelling
number of active file descriptions that you'll get by using
ROUTER/DEALER is not trivial, assuming you can get past the management
of all of those synchronous sockets (hidden away by tons of eventlet
greenthreads)...
Extra anecdote: During a conversation at the OpenStack summit, someone
told me about their experiences using ZeroMQ and the pain of using
REQ/REP sockets and how they felt it was a mistake they used them. We
discussed a bit about some other problems such as the fact it's
impossible to avoid TCP fragmentation unless you force all frames to
552 bytes or have a well-managed network where you know the MTUs of
all the devices you'll pass through. Suggestions were made to make
ZeroMQ better, until we realized we had just described
TCP-over-ZeroMQ-over-TCP, finished our beers, and quickly changed topics.
Well, seems I need to take my last question back. In our deployment, I
always take advantage of jumbo frame to increase throughput. You said
that REQ/REP would introduce TCP fragmentation unless zeromq frames ==
552 bytes? Could you please elaborate?
_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev