Ilya Maximets <i.maxim...@ovn.org> writes:

> On 6/7/22 11:39, Robin Jarry wrote:
>> Paolo Valerio, Jun 05, 2022 at 19:37:
>>> Just a note that may be useful.
>>> After some tests, I noticed that establishing e.g. two TCP connections,
>>> and leaving the first one idle after 3whs, once the second connection
>>> expires (after moving to TIME_WAIT as a result of termination), the
>>> second doesn't get evicted until any event gets scheduled for the first.
>>>
>>> ovs-appctl dpctl/dump-conntrack -s
>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=84576,protoinfo=(state=ESTABLISHED)
>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),zone=1,timeout=0,protoinfo=(state=TIME_WAIT)
>>>
>>> This may be somewhat related to your results as during the
>>> test, the number of connections may reach the limit so apparently reducing
>>> the performances.
>> 
>> Indeed, there was an issue in my test procedure. Due to the way T-Rex
>> generates connections, it is easy to fill the conntrack table after
>> a few iterations, making the test results inconsistent.
>> 
>> Also, the flows which I had configured were not correct. There was an
>> extraneous action=NORMAL flow at the end. When the conntrack table is
>> full and a new packet cannot be tracked, it is marked as +trk+inv and
>> not dropped. This behaviour is specific to the userspace datapath. The
>> linux kernel datapath seems to drop the packet when it cannot be added
>> to connection tracking.
>> 
>> Gaƫtan's series (v4) seems less resilient to the conntrack table being
>> full, especially when there is more than one PMD core.
>> 
>> I have changed the t-rex script to allow running arbitrary commands in
>> between traffic iterations. This is leveraged to flush the conntrack
>> table and run each iteration in the same conditions.
>> 
>> https://github.com/cisco-system-traffic-generator/trex-core/blob/v2.98/scripts/cps_ndr.py
>> 
>> To avoid filling the conntrack table, the max size was increased to 50M.
>> The DUT configuration can be summarized as the following:
>> 
>> ovs-vsctl set open_vswitch . other_config:dpdk-init=true
>> ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x15554"
>> ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
>> ovs-vsctl add-port br0 pf0 -- set Interface pf0 type=dpdk \
>>     options:dpdk-devargs=0000:3b:00.0 options:n_rxq=4 options:n_rxq_desc=4096
>> ovs-vsctl add-port br0 pf1 -- set Interface pf1 type=dpdk \
>>     options:dpdk-devargs=0000:3b:00.1 options:n_rxq=4 options:n_rxq_desc=4096
>> ovs-appctl dpctl/ct-set-maxconns 50000000
>> ovs-ofctl add-flow br0 
>> "table=0,priority=10,ip,ct_state=-trk,actions=ct(table=0)"
>> ovs-ofctl add-flow br0 
>> "table=0,priority=10,ip,ct_state=+trk+new,actions=ct(commit),NORMAL"
>> ovs-ofctl add-flow br0 
>> "table=0,priority=10,ip,ct_state=+trk+est,actions=NORMAL"
>> ovs-ofctl add-flow br0 "table=0,priority=0,actions=drop"
>> 
>> Short Lived Connections
>> -----------------------
>> 
>> ./cps_ndr.py --sample-time 10 --max-iterations 8 --error-threshold 0.01 \
>>     --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \
>>     --udp-percent 1 --num-messages 1 --message-size 20 --server-wait 0 \
>>     --min-cps 10k --max-cps 600k
>> 
>> ============== =============== ============== ============== ===========
>> Series         Num. Flows      CPS            PPS            BPS
>> ============== =============== ============== ============== ===========
>> Baseline       40.1K           79.3K/s        556Kp/s        347Mb/s
>> Gaetan v1      60.5K           121K/s         837Kp/s        522Mb/s
>> Gaetan v4      61.4K           122K/s         850Kp/s        530Mb/s
>> Paolo          377K            756K/s         5.3Mp/s        3.3Gb/s
>> ============== =============== ============== ============== ===========
>> 
>> Even after fixing the test procedure, Paolo's series still performs
>> a lot better with short lived connections.
>> 
>> Long Lived Connections
>> ----------------------
>> 
>> ./cps_ndr.py --sample-time 30 --max-iterations 8 --error-threshold 0.01 \
>>     --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \
>>     --udp-percent 1 --num-messages 500 --message-size 20 --server-wait 50 \
>>     --min-cps 100 --max-cps 10k
>> 
>> ============== =============== ============== ============== ===========
>> Series         Num. Flows      CPS            PPS            BPS
>> ============== =============== ============== ============== ===========
>> Baseline       17.4K           504/s          633Kp/s        422Mb/s
>> Gaetan v1      80.4K           3.1K/s         4.6Mp/s        3.0Gb/s
>> Gaetan v4      139K            5.4K/s         8.2Mp/s        5.4Gb/s
>> Paolo          132K            5.2K/s         7.7Mp/s        5.2Gb/s
>> ============== =============== ============== ============== ===========
>> 
>> Thanks to Paolo for his help on this second round of tests.
>
> Hi, Robin and Paolo.
>

Hello Ilya,

thanks for the feedback about this.

> Could you try the following change on top of Gaetan's v1 and v4:
>
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index aa4792f69..2c2203a4b 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -633,14 +633,17 @@ conn_key_lookup(struct conntrack *ct, const struct 
> conn_key *key,
>      bool found = false;
>  
>      CMAP_FOR_EACH_WITH_HASH (conn, cm_node, hash, &ct->conns) {
> -        if (!conn_key_cmp(&conn->key, key) && !conn_expired(conn, now)) {
> +        if (conn_expired(conn, now)) {
> +            continue;
> +        }
> +        if (!conn_key_cmp(&conn->key, key)) {
>              found = true;
>              if (reply) {
>                  *reply = false;
>              }
>              break;
>          }
> -        if (!conn_key_cmp(&conn->rev_key, key) && !conn_expired(conn, now)) {
> +        if (!conn_key_cmp(&conn->rev_key, key)) {
>              found = true;
>              if (reply) {
>                  *reply = true;
> ---
>
> ?
>
> Let me explain.  As far as I can tell without actually running performance
> tests myself, the main issue with both variants of Gaetan's series is a slow
> cleanup of expired connections.
>
> v4, IMO, contains a serious bug, because the expiration list is treated at the
> sweep phase as if it was sorted by the expiration time.  But that is not 
> really
> true, because connections can not be re-ordered.  They seem to stay in the 
> exact
> same order as they were created, but expiration times are getting updated when
> packets are received.  However, the sweep process stops at the moment the 
> first
> non-expired connection is seen and we're not cleaning up later connections
> as reported by Paolo.  Taking into account the fact that default timeout 
> policy
> for established TCP connections appears to be 24 hours (or did I misread 
> this?),
> closed connections can stay not cleaned up for that time.  That seems like
> way too much.  The point is, the whole queue should be inspected on every 
> sweep
> iteration in order to re-sort it and actually move connections between 
> different
> expiration lists to avoid silent established connections blocking closed ones.
> This will probably be slow (?).  MPSC queue doesn't seem to be very suitable 
> for
> such usecase.

I agree with the cause of the issue, and about the MPSC suitability as
well unless they are used in different ways.

Speaking of connections blocking other connections, there's a potential
problem in the current code and it is related to zones and timeout
policy.

tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=0,protoinfo=(state=TIME_WAIT)
tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),timeout=180,protoinfo=(state=TIME_WAIT)

In this case I set the timeout for tcp_end to 200 for zone 0, and 20
seconds to zone 1.
The problem is less severe compared to the ct-scale series, but could
(more or less significantly, depends on the way timeouts are configured)
increase the latency of a connection in the same list.

>
> v1.  Here the situation is better since threads are kind of allowed to 
> reschedule
> on their own, but via RCU thread.  It looks like re-scheduled connections can 
> be
> messed up between each other still though, because expiration can be updated
> multiple times, while re-insert is happening once.  And also these 
> remove+push_back

which also makes me wonder if there's a chance of losing a timeout
transition.

> operations for different threads will be additionally re-ordered by the RCU
> thread.  So, the order is not really preserved here as well.  In order to fix
> that we need to replace push_back operation with the search+insert, i.e. find 
> the
> place where the connection should be inserted.  And even that will not fully
> protect from the concurrent update of the expiration.  OTOH, not fully sorted
> list may be not a huge problem, because connections are still moved between
> different expiration lists, so ESTABLISHED connection can not block clean up 
> of
> the closed one. 
> The RCU thread though can be a serious bottleneck in this concept.  PMD 
> threads
> are entering quiescent state only once per 1024 iterations or even more, so
> we can have 64K or more updates per thread per grace period.   So, expired or
> closed connections can stay for a prolonged period of time.

That's a good point.

>
> So, in both v1 and v4 of Gaetan's series there are, probably, a lot of 
> elements
> in a hash map and lookup is likely slow because of this.  Also, conn_key_cmp()
> is an extremely heavy operation, and we're performing it twice for every 
> connection
> with the same hash.  Paolo's series cheats here by quickly skipping or even
> cleaning up expired connection during the lookup.  Gaetan's series' are not 
> able
> to clean them, but can still quickly skip all expired ones without executing
> two costly comparisons.  Hence the code snippet above.  Previously, 
> conn_expired
> check required the global lock, but it doesn't with any of the series applied.
>
>
> In conclusion, I don't think that we can efficiently use MPSC queue directly.
> At least, we need an ability to efficiently move connections between lists to
> avoid CLOSED connections being blocked by ESTABLISHED ones.  One option is
> to always look through the whole list and re-insert every connection.  It 
> might
> be not a huge problem though.  It was a problem when we had to hold the global
> mutex for that.  If we don't need to hold the global mutex, we may allow that.

Some thoughts/questions about it:

does this require to go all the way through the list at once?
Probably it's not a problem, but in case of many connections
destruction/creation, isn't there a chance we keep going through the
list for a long time?

For sure it is the most relevant, but I wonder if this is only
applicable to the ESTABLISHED list.
I'm thinking to e.g. UDP_FIRST and UDP_MULTIPLE with custom timeouts.

Otherwise,

> If the list itself is too long, we may just have normal lists and use MPSC
> queue to enqueue "update" events for the ct_sweep thread the same way as what
> is done for RCU thread in v1.  Can be a possible solution?
>

I guess this involve some reorder (any update could) inside the same
list.

In such case we have to handle both the events (and reorder, if we
care), and the evictions in two separate moments.

> rculist has a chance to be a viable option, would be great to look at 
> performance
> numbers with the code snippet above.
>
> Paolo's series seems to not have any obvious issues, at least I didn't spot
> any on a quick glance.  At least, until buckets are balanced.  In some cases
> we may end up locking the same bucket from different threads frequently 
> though.
> Do you have the original and reply traffic being handled by the same or
> different PMD threads in your performance tests?  Also, a lot of different
> connections is only one side of the performance.  Having a single 
> traffic-heavy
> connection where original and reply directions are handled by 2 different PMD
> threads would be an important test as well.  I think, it is even closer to the
> real world cases. 
>

Will be back on this and the v1/v4 + your snippet (which seems to be
potentially a measurable improvement) once some numbers are available.

> Thoughts?
>
> Best regards, Ilya Maximets.

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to