On 6/7/22 11:39, Robin Jarry wrote:
> Paolo Valerio, Jun 05, 2022 at 19:37:
>> Just a note that may be useful.
>> After some tests, I noticed that establishing e.g. two TCP connections,
>> and leaving the first one idle after 3whs, once the second connection
>> expires (after moving to TIME_WAIT as a result of termination), the
>> second doesn't get evicted until any event gets scheduled for the first.
>>
>> ovs-appctl dpctl/dump-conntrack -s
>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=84576,protoinfo=(state=ESTABLISHED)
>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),zone=1,timeout=0,protoinfo=(state=TIME_WAIT)
>>
>> This may be somewhat related to your results as during the
>> test, the number of connections may reach the limit so apparently reducing
>> the performances.
> 
> Indeed, there was an issue in my test procedure. Due to the way T-Rex
> generates connections, it is easy to fill the conntrack table after
> a few iterations, making the test results inconsistent.
> 
> Also, the flows which I had configured were not correct. There was an
> extraneous action=NORMAL flow at the end. When the conntrack table is
> full and a new packet cannot be tracked, it is marked as +trk+inv and
> not dropped. This behaviour is specific to the userspace datapath. The
> linux kernel datapath seems to drop the packet when it cannot be added
> to connection tracking.
> 
> Gaƫtan's series (v4) seems less resilient to the conntrack table being
> full, especially when there is more than one PMD core.
> 
> I have changed the t-rex script to allow running arbitrary commands in
> between traffic iterations. This is leveraged to flush the conntrack
> table and run each iteration in the same conditions.
> 
> https://github.com/cisco-system-traffic-generator/trex-core/blob/v2.98/scripts/cps_ndr.py
> 
> To avoid filling the conntrack table, the max size was increased to 50M.
> The DUT configuration can be summarized as the following:
> 
> ovs-vsctl set open_vswitch . other_config:dpdk-init=true
> ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x15554"
> ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
> ovs-vsctl add-port br0 pf0 -- set Interface pf0 type=dpdk \
>     options:dpdk-devargs=0000:3b:00.0 options:n_rxq=4 options:n_rxq_desc=4096
> ovs-vsctl add-port br0 pf1 -- set Interface pf1 type=dpdk \
>     options:dpdk-devargs=0000:3b:00.1 options:n_rxq=4 options:n_rxq_desc=4096
> ovs-appctl dpctl/ct-set-maxconns 50000000
> ovs-ofctl add-flow br0 
> "table=0,priority=10,ip,ct_state=-trk,actions=ct(table=0)"
> ovs-ofctl add-flow br0 
> "table=0,priority=10,ip,ct_state=+trk+new,actions=ct(commit),NORMAL"
> ovs-ofctl add-flow br0 
> "table=0,priority=10,ip,ct_state=+trk+est,actions=NORMAL"
> ovs-ofctl add-flow br0 "table=0,priority=0,actions=drop"
> 
> Short Lived Connections
> -----------------------
> 
> ./cps_ndr.py --sample-time 10 --max-iterations 8 --error-threshold 0.01 \
>     --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \
>     --udp-percent 1 --num-messages 1 --message-size 20 --server-wait 0 \
>     --min-cps 10k --max-cps 600k
> 
> ============== =============== ============== ============== ===========
> Series         Num. Flows      CPS            PPS            BPS
> ============== =============== ============== ============== ===========
> Baseline       40.1K           79.3K/s        556Kp/s        347Mb/s
> Gaetan v1      60.5K           121K/s         837Kp/s        522Mb/s
> Gaetan v4      61.4K           122K/s         850Kp/s        530Mb/s
> Paolo          377K            756K/s         5.3Mp/s        3.3Gb/s
> ============== =============== ============== ============== ===========
> 
> Even after fixing the test procedure, Paolo's series still performs
> a lot better with short lived connections.
> 
> Long Lived Connections
> ----------------------
> 
> ./cps_ndr.py --sample-time 30 --max-iterations 8 --error-threshold 0.01 \
>     --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \
>     --udp-percent 1 --num-messages 500 --message-size 20 --server-wait 50 \
>     --min-cps 100 --max-cps 10k
> 
> ============== =============== ============== ============== ===========
> Series         Num. Flows      CPS            PPS            BPS
> ============== =============== ============== ============== ===========
> Baseline       17.4K           504/s          633Kp/s        422Mb/s
> Gaetan v1      80.4K           3.1K/s         4.6Mp/s        3.0Gb/s
> Gaetan v4      139K            5.4K/s         8.2Mp/s        5.4Gb/s
> Paolo          132K            5.2K/s         7.7Mp/s        5.2Gb/s
> ============== =============== ============== ============== ===========
> 
> Thanks to Paolo for his help on this second round of tests.

Hi, Robin and Paolo.

Could you try the following change on top of Gaetan's v1 and v4:

diff --git a/lib/conntrack.c b/lib/conntrack.c
index aa4792f69..2c2203a4b 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -633,14 +633,17 @@ conn_key_lookup(struct conntrack *ct, const struct 
conn_key *key,
     bool found = false;
 
     CMAP_FOR_EACH_WITH_HASH (conn, cm_node, hash, &ct->conns) {
-        if (!conn_key_cmp(&conn->key, key) && !conn_expired(conn, now)) {
+        if (conn_expired(conn, now)) {
+            continue;
+        }
+        if (!conn_key_cmp(&conn->key, key)) {
             found = true;
             if (reply) {
                 *reply = false;
             }
             break;
         }
-        if (!conn_key_cmp(&conn->rev_key, key) && !conn_expired(conn, now)) {
+        if (!conn_key_cmp(&conn->rev_key, key)) {
             found = true;
             if (reply) {
                 *reply = true;
---

?

Let me explain.  As far as I can tell without actually running performance
tests myself, the main issue with both variants of Gaetan's series is a slow
cleanup of expired connections.

v4, IMO, contains a serious bug, because the expiration list is treated at the
sweep phase as if it was sorted by the expiration time.  But that is not really
true, because connections can not be re-ordered.  They seem to stay in the exact
same order as they were created, but expiration times are getting updated when
packets are received.  However, the sweep process stops at the moment the first
non-expired connection is seen and we're not cleaning up later connections
as reported by Paolo.  Taking into account the fact that default timeout policy
for established TCP connections appears to be 24 hours (or did I misread this?),
closed connections can stay not cleaned up for that time.  That seems like
way too much.  The point is, the whole queue should be inspected on every sweep
iteration in order to re-sort it and actually move connections between different
expiration lists to avoid silent established connections blocking closed ones.
This will probably be slow (?).  MPSC queue doesn't seem to be very suitable for
such usecase.

v1.  Here the situation is better since threads are kind of allowed to 
reschedule
on their own, but via RCU thread.  It looks like re-scheduled connections can be
messed up between each other still though, because expiration can be updated
multiple times, while re-insert is happening once.  And also these 
remove+push_back
operations for different threads will be additionally re-ordered by the RCU
thread.  So, the order is not really preserved here as well.  In order to fix
that we need to replace push_back operation with the search+insert, i.e. find 
the
place where the connection should be inserted.  And even that will not fully
protect from the concurrent update of the expiration.  OTOH, not fully sorted
list may be not a huge problem, because connections are still moved between
different expiration lists, so ESTABLISHED connection can not block clean up of
the closed one. 
The RCU thread though can be a serious bottleneck in this concept.  PMD threads
are entering quiescent state only once per 1024 iterations or even more, so
we can have 64K or more updates per thread per grace period.   So, expired or
closed connections can stay for a prolonged period of time.

So, in both v1 and v4 of Gaetan's series there are, probably, a lot of elements
in a hash map and lookup is likely slow because of this.  Also, conn_key_cmp()
is an extremely heavy operation, and we're performing it twice for every 
connection
with the same hash.  Paolo's series cheats here by quickly skipping or even
cleaning up expired connection during the lookup.  Gaetan's series' are not able
to clean them, but can still quickly skip all expired ones without executing
two costly comparisons.  Hence the code snippet above.  Previously, conn_expired
check required the global lock, but it doesn't with any of the series applied.


In conclusion, I don't think that we can efficiently use MPSC queue directly.
At least, we need an ability to efficiently move connections between lists to
avoid CLOSED connections being blocked by ESTABLISHED ones.  One option is
to always look through the whole list and re-insert every connection.  It might
be not a huge problem though.  It was a problem when we had to hold the global
mutex for that.  If we don't need to hold the global mutex, we may allow that.
If the list itself is too long, we may just have normal lists and use MPSC
queue to enqueue "update" events for the ct_sweep thread the same way as what
is done for RCU thread in v1.  Can be a possible solution?

rculist has a chance to be a viable option, would be great to look at 
performance
numbers with the code snippet above.

Paolo's series seems to not have any obvious issues, at least I didn't spot
any on a quick glance.  At least, until buckets are balanced.  In some cases
we may end up locking the same bucket from different threads frequently though.
Do you have the original and reply traffic being handled by the same or
different PMD threads in your performance tests?  Also, a lot of different
connections is only one side of the performance.  Having a single traffic-heavy
connection where original and reply directions are handled by 2 different PMD
threads would be an important test as well.  I think, it is even closer to the
real world cases. 

Thoughts?

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to