Re: [ovs-dev] [PATCH v2 3/4] Add file descriptor persistence where possible

Dumitru Ceara Wed, 19 Feb 2020 06:22:09 -0800

On 2/18/20 7:12 AM, Anton Ivanov wrote:
> 
> 
> On 17/02/2020 14:48, Dumitru Ceara wrote:
>> On 2/14/20 6:54 PM, [email protected] wrote:
>>> From: Anton Ivanov <[email protected]>
>>>
>>> 1. Adds "persistent" behaviour where feasible (streams and signals).
>>> These are waited upon in the same thread where they are created. This
>>> allows them to be registered persistently with the OS (if possible)
>>> as well as the OS to provide hints - is the FD ready, is it closed,
>>> etc.
>>>
>>> 2. Removes unnecessary attempts to perform a read vs EAGAIN on a fd
>>> which is not ready if that fd has been registered as "private" to the
>>> thread which waits upon it.
>>>
>>> 3. No longer breaks other parts of OVS which create the fd in one
>>> thread and waits upon it in others.
>>>
>>> 3. Adds support for EPOLL on Linux and can be expanded to cover similar
>>> poll++ frameworks in other OSes.
>>>
>>> 4. Sets up the necessary infrastructure to make IO/SSL multi-threaded
>>> using a "centeral (e)poll dispatcher + IO threads" pattern
>>>
>>> Signed-off-by: Anton Ivanov <[email protected]>
>>
>> Hi Anton,
>>
>> A couple of issues inline. Except for that:
>>
>> 1. The "STP - flush the fdb and mdb when topology changed" OVS test is
>> failing with your patches applied:
>>
>> make check TESTSUITEFLAGS='-k "flush the fdb"'
> 
> I will have a look.
> 
>>
>> 2. Travis CI build fails:
>>
>> lib/fatal-signal.c:244:5: error: ignoring return value of ‘read’,
>> declared with attribute warn_unused_result [-Werror=unused-result]
>>
>>       read(signal_fds[0], sigbuffer, sizeof(sigbuffer))
>>
>> 3. Travis CI OSX build fails:
>>
>> lib/poll-loop.c:46:1: error: unused function 'poll_create_node_add'
>> [-Werror,-Wunused-function]
>>
>> COVERAGE_DEFINE(poll_create_node);
> 
> I will fix all of these in the next version. The CI was out with the
> daisies when I submitted the patch last week so I did not see the logs
> until yesterday.
> 
>>
>> 4. While OVS might benefit from these changes I'm wondering about OVN
>> and ovsdb-server specifically. ovsdb-server is single threaded and
>> usually on large scale deployments we don't really see "poll" as the
>> bottleneck or even the fact that code tries to read/write from FDs when
>> FDs are not available for read/write.
>>
>> For example, here are results of running a scale test scenario which
>> repeats the following iteration 300 times:
>> - bring up a node (ovn-fake-multinode container) and connect it to the
>> OVN Southbound DB.
>> - configure an OVN logical switch to be bound to the new node.
>> - configure an OVN logical switch port on the new logical switch.
>> - configure an OVS internal interface on the new node and bind it to the
>> OVN logical switch port.
>> - wait until the new internal interface can ping its default gateway
>> through OVN (i.e., until ovn-controller on the node received all updates
>> from the SB DB and installed all OVS flows), highlighted in the output.
>>
>> The tests use rally-ovs (ovn-scale-test) on a 9 server setup (1 machine
>> running OVN ovsdb-servers and ovn-northd and 8 machines simulating
>> chassis using ovn-fake-multinode), in particular this modified scenario:
>> https://github.com/dceara/ovn-scale-test/blob/ovn-switch-per-node/samples/tasks/scenarios/ovn-network/osh_workload_incremental.json
>>
>>
>> With OVS master and OVN master:
>> http://pastebin.test.redhat.com/836568
>>
>> With OVS master + your patches and OVN master:
>> http://pastebin.test.redhat.com/836571
>>
>> Here are some of the logs we get on the OVN Southbound DB ovsdb-server
>> that show that ovsdb-server spends up to 2 seconds in a single loop
>> iteration sending/receiving updates to/from ovn-controllers:
>>
>> 2020-02-17T10:43:41.175Z|01991|poll_loop|INFO|wakeup due to [OVS_POLLIN]
>> on fd 140 (192.16.0.1:6642<->192.16.0.120:52018) at lib/stream-fd.c:79
>> (84% CPU usage)
>> 2020-02-17T10:43:43.338Z|01992|timeval|WARN|Unreasonably long 2163ms
>> poll interval (2144ms user, 9ms system)
>> 2020-02-17T10:43:43.339Z|01993|timeval|WARN|faults: 590 minor, 0 major
>> 2020-02-17T10:43:43.339Z|01994|timeval|WARN|disk: 0 reads, 8 writes
>> 2020-02-17T10:43:43.339Z|01995|timeval|WARN|context switches: 0
>> voluntary, 4 involuntary
>> 2020-02-17T10:43:43.339Z|01996|poll_loop|INFO|Dropped 63 log messages in
>> last 2 seconds (most recently, 2 seconds ago) due to excessive rate
>> 2020-02-17T10:43:43.339Z|01997|poll_loop|INFO|wakeup due to [OVS_POLLIN]
>> on fd 76 (192.16.0.1:6642<->192.16.0.56:33538) at lib/stream-fd.c:79
>> (84% CPU usage)
>> 2020-02-17T10:43:45.495Z|01998|timeval|WARN|Unreasonably long 2156ms
>> poll interval (2129ms user, 17ms system)
>> 2020-02-17T10:43:45.495Z|01999|timeval|WARN|faults: 738 minor, 0 major
>> 2020-02-17T10:43:45.495Z|02000|timeval|WARN|context switches: 0
>> voluntary, 7 involuntary
>> 2020-02-17T10:43:47.651Z|02001|timeval|WARN|Unreasonably long 2157ms
>> poll interval (2136ms user, 10ms system)
>>
>> In this case, and I think in most OVN use cases, ovsdb-server is busy
>> because it actually has to send updates to large numbers of
>> ovn-controllers connected to it. Unless I'm missing something the epoll
>> change seems to improve performance only in cases where the Southbound
>> DB doesn't do much sending/receiving. 
> 
> 1. Correct - it improves the handling and the cost of idle connections.
> At present each connection is a fixed cost regardless of does it need
> servicing or not.
> 
> 2. It should also improve the cost of handling of many peers to send if
> there is enough outstanding data on sockets to create EGAIN on send.
> 
> 3. It does not fix the fundamental problem that the logic in ovsdb is
> single threaded. It does, however allow json+io+ssl to become
> multi-threaded which should leave just the logic in the main ovsd-server
> thread. You cannot do that effectively without having information on the
> state of the socket and does it need servicing.
> 
> 4. I have seen the rally tests - they are flat scale-up. I agree that
> the help from fixing the IO loop will be minimal because there is IO on
> most fds to be serviced at all times. I would not expect them to help a
> lot there.
> 
> The difference from fixing the IO (leaving aside that it is a
> prerequisite to getting SSL to worker threads) is not in scaling up, but
> running at scale. The current "ram the EGAIN wall until it gives up"
> design gives you a mandatory penalty per transaction deploying config to
> one node while running (because of attempted failed reads on EAGAIN).
> The correct test for that is the CPU cost and latency to deploy a
> logical flow when running at steady state in f.e. 300 nodes.


Currently when a logical flow is added to the SB DB for a logical
datapath (logical switch or logical router) all ovn-controllers that
consider the logical datapath as "local" will receive the update from
ovsdb SB.

ovn-controller local datapaths are:
- all logical switch datapaths that have ports bound to the local chassis.
- all logical datapaths corresponding to logical routers connected
through OVN patch ports to local logical switches.

While it's true that in some deployments this will limit the number of
ovn-controllers receiving the update, in OVN-k8s, for example, all node
logical switches are connected to a single cluster logical router. Once
a logical flow is added to any of the node logical switches all
ovn-controllers on all nodes should receive the update. This means that
there won't be too many idle connections during a SB DB logical flow update.

I agree that the right way to test is to see how long it takes for a
logical flow to be installed, which is what the rally-ovs tests do: they
measure how long until the SB DB logical flow is translated to openflow
by ovn-controller when a new node is brought up. At iteration X+1 we are
at a steady state (X nodes up and configured) and node X+1 is brought up
and its corresponding logical flows are added.

> 
> The current design as written has that penalty as a given - it is
> unavoidable and it also grows linear (or worse) with size from other
> factors in addition to EGAIN - f.e you also start copying a large pollfd
> array to the kernel back and fourth on every iteration.
> 
> I have not seen any tests trying to quantify this on a actual cluster.
> All tests I have seen so far are scale-up and/or running a single
> synthetic application on the whole cluster.
> 
> Mine are mostly taking the code in question and running it outside of
> OVS/OVN on a harness.
> 
> 
>> How do you test
>> performance/scalability improvements?
>>
>> Regards,
>> Dumitru
>>

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v2 3/4] Add file descriptor persistence where possible

Reply via email to