Re: [ovs-dev] User space connection tracking benchmarks

Ilya Maximets Tue, 21 Jun 2022 06:09:41 -0700

On 6/20/22 23:57, Paolo Valerio wrote:
> Ilya Maximets <[email protected]> writes:
> 
>> On 6/7/22 11:39, Robin Jarry wrote:
>>> Paolo Valerio, Jun 05, 2022 at 19:37:
>>>> Just a note that may be useful.
>>>> After some tests, I noticed that establishing e.g. two TCP connections,
>>>> and leaving the first one idle after 3whs, once the second connection
>>>> expires (after moving to TIME_WAIT as a result of termination), the
>>>> second doesn't get evicted until any event gets scheduled for the first.
>>>>
>>>> ovs-appctl dpctl/dump-conntrack -s
>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=84576,protoinfo=(state=ESTABLISHED)
>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),zone=1,timeout=0,protoinfo=(state=TIME_WAIT)
>>>>
>>>> This may be somewhat related to your results as during the
>>>> test, the number of connections may reach the limit so apparently reducing
>>>> the performances.
>>>
>>> Indeed, there was an issue in my test procedure. Due to the way T-Rex
>>> generates connections, it is easy to fill the conntrack table after
>>> a few iterations, making the test results inconsistent.
>>>
>>> Also, the flows which I had configured were not correct. There was an
>>> extraneous action=NORMAL flow at the end. When the conntrack table is
>>> full and a new packet cannot be tracked, it is marked as +trk+inv and
>>> not dropped. This behaviour is specific to the userspace datapath. The
>>> linux kernel datapath seems to drop the packet when it cannot be added
>>> to connection tracking.
>>>
>>> Gaëtan's series (v4) seems less resilient to the conntrack table being
>>> full, especially when there is more than one PMD core.
>>>
>>> I have changed the t-rex script to allow running arbitrary commands in
>>> between traffic iterations. This is leveraged to flush the conntrack
>>> table and run each iteration in the same conditions.
>>>
>>> https://github.com/cisco-system-traffic-generator/trex-core/blob/v2.98/scripts/cps_ndr.py
>>>
>>> To avoid filling the conntrack table, the max size was increased to 50M.
>>> The DUT configuration can be summarized as the following:
>>>
>>> ovs-vsctl set open_vswitch . other_config:dpdk-init=true
>>> ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x15554"
>>> ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
>>> ovs-vsctl add-port br0 pf0 -- set Interface pf0 type=dpdk \
>>>     options:dpdk-devargs=0000:3b:00.0 options:n_rxq=4 
>>> options:n_rxq_desc=4096
>>> ovs-vsctl add-port br0 pf1 -- set Interface pf1 type=dpdk \
>>>     options:dpdk-devargs=0000:3b:00.1 options:n_rxq=4 
>>> options:n_rxq_desc=4096
>>> ovs-appctl dpctl/ct-set-maxconns 50000000
>>> ovs-ofctl add-flow br0 
>>> "table=0,priority=10,ip,ct_state=-trk,actions=ct(table=0)"
>>> ovs-ofctl add-flow br0 
>>> "table=0,priority=10,ip,ct_state=+trk+new,actions=ct(commit),NORMAL"
>>> ovs-ofctl add-flow br0 
>>> "table=0,priority=10,ip,ct_state=+trk+est,actions=NORMAL"
>>> ovs-ofctl add-flow br0 "table=0,priority=0,actions=drop"
>>>
>>> Short Lived Connections
>>> -----------------------
>>>
>>> ./cps_ndr.py --sample-time 10 --max-iterations 8 --error-threshold 0.01 \
>>>     --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \
>>>     --udp-percent 1 --num-messages 1 --message-size 20 --server-wait 0 \
>>>     --min-cps 10k --max-cps 600k
>>>
>>> ============== =============== ============== ============== ===========
>>> Series         Num. Flows      CPS            PPS            BPS
>>> ============== =============== ============== ============== ===========
>>> Baseline       40.1K           79.3K/s        556Kp/s        347Mb/s
>>> Gaetan v1      60.5K           121K/s         837Kp/s        522Mb/s
>>> Gaetan v4      61.4K           122K/s         850Kp/s        530Mb/s
>>> Paolo          377K            756K/s         5.3Mp/s        3.3Gb/s
>>> ============== =============== ============== ============== ===========
>>>
>>> Even after fixing the test procedure, Paolo's series still performs
>>> a lot better with short lived connections.
>>>
>>> Long Lived Connections
>>> ----------------------
>>>
>>> ./cps_ndr.py --sample-time 30 --max-iterations 8 --error-threshold 0.01 \
>>>     --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \
>>>     --udp-percent 1 --num-messages 500 --message-size 20 --server-wait 50 \
>>>     --min-cps 100 --max-cps 10k
>>>
>>> ============== =============== ============== ============== ===========
>>> Series         Num. Flows      CPS            PPS            BPS
>>> ============== =============== ============== ============== ===========
>>> Baseline       17.4K           504/s          633Kp/s        422Mb/s
>>> Gaetan v1      80.4K           3.1K/s         4.6Mp/s        3.0Gb/s
>>> Gaetan v4      139K            5.4K/s         8.2Mp/s        5.4Gb/s
>>> Paolo          132K            5.2K/s         7.7Mp/s        5.2Gb/s
>>> ============== =============== ============== ============== ===========
>>>
>>> Thanks to Paolo for his help on this second round of tests.
>>
>> Hi, Robin and Paolo.
>>
> 
> Hello Ilya,
> 
> thanks for the feedback about this.
> 
>> Could you try the following change on top of Gaetan's v1 and v4:
>>
>> diff --git a/lib/conntrack.c b/lib/conntrack.c
>> index aa4792f69..2c2203a4b 100644
>> --- a/lib/conntrack.c
>> +++ b/lib/conntrack.c
>> @@ -633,14 +633,17 @@ conn_key_lookup(struct conntrack *ct, const struct 
>> conn_key *key,
>>      bool found = false;
>>  
>>      CMAP_FOR_EACH_WITH_HASH (conn, cm_node, hash, &ct->conns) {
>> -        if (!conn_key_cmp(&conn->key, key) && !conn_expired(conn, now)) {
>> +        if (conn_expired(conn, now)) {
>> +            continue;
>> +        }
>> +        if (!conn_key_cmp(&conn->key, key)) {
>>              found = true;
>>              if (reply) {
>>                  *reply = false;
>>              }
>>              break;
>>          }
>> -        if (!conn_key_cmp(&conn->rev_key, key) && !conn_expired(conn, now)) 
>> {
>> +        if (!conn_key_cmp(&conn->rev_key, key)) {
>>              found = true;
>>              if (reply) {
>>                  *reply = true;
>> ---
>>
>> ?
>>
>> Let me explain.  As far as I can tell without actually running performance
>> tests myself, the main issue with both variants of Gaetan's series is a slow
>> cleanup of expired connections.
>>
>> v4, IMO, contains a serious bug, because the expiration list is treated at 
>> the
>> sweep phase as if it was sorted by the expiration time.  But that is not 
>> really
>> true, because connections can not be re-ordered.  They seem to stay in the 
>> exact
>> same order as they were created, but expiration times are getting updated 
>> when
>> packets are received.  However, the sweep process stops at the moment the 
>> first
>> non-expired connection is seen and we're not cleaning up later connections
>> as reported by Paolo.  Taking into account the fact that default timeout 
>> policy
>> for established TCP connections appears to be 24 hours (or did I misread 
>> this?),
>> closed connections can stay not cleaned up for that time.  That seems like
>> way too much.  The point is, the whole queue should be inspected on every 
>> sweep
>> iteration in order to re-sort it and actually move connections between 
>> different
>> expiration lists to avoid silent established connections blocking closed 
>> ones.
>> This will probably be slow (?).  MPSC queue doesn't seem to be very suitable 
>> for
>> such usecase.
> 
> I agree with the cause of the issue, and about the MPSC suitability as
> well unless they are used in different ways.
> 
> Speaking of connections blocking other connections, there's a potential
> problem in the current code and it is related to zones and timeout
> policy.
> 
> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=0,protoinfo=(state=TIME_WAIT)
> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),timeout=180,protoinfo=(state=TIME_WAIT)
> 
> In this case I set the timeout for tcp_end to 200 for zone 0, and 20
> seconds to zone 1.
> The problem is less severe compared to the ct-scale series, but could
> (more or less significantly, depends on the way timeouts are configured)
> increase the latency of a connection in the same list.


Yep.  I noticed that issue too.  It is a bug in the current implementation.
Expiration lists should not be global, but per-zone, AFAICT.

> 
>>
>> v1.  Here the situation is better since threads are kind of allowed to 
>> reschedule
>> on their own, but via RCU thread.  It looks like re-scheduled connections 
>> can be
>> messed up between each other still though, because expiration can be updated
>> multiple times, while re-insert is happening once.  And also these 
>> remove+push_back
> 
> which also makes me wonder if there's a chance of losing a timeout
> transition.
> 
>> operations for different threads will be additionally re-ordered by the RCU
>> thread.  So, the order is not really preserved here as well.  In order to fix
>> that we need to replace push_back operation with the search+insert, i.e. 
>> find the
>> place where the connection should be inserted.  And even that will not fully
>> protect from the concurrent update of the expiration.  OTOH, not fully sorted
>> list may be not a huge problem, because connections are still moved between
>> different expiration lists, so ESTABLISHED connection can not block clean up 
>> of
>> the closed one. 
>> The RCU thread though can be a serious bottleneck in this concept.  PMD 
>> threads
>> are entering quiescent state only once per 1024 iterations or even more, so
>> we can have 64K or more updates per thread per grace period.   So, expired or
>> closed connections can stay for a prolonged period of time.
> 
> That's a good point.
> 
>>
>> So, in both v1 and v4 of Gaetan's series there are, probably, a lot of 
>> elements
>> in a hash map and lookup is likely slow because of this.  Also, 
>> conn_key_cmp()
>> is an extremely heavy operation, and we're performing it twice for every 
>> connection
>> with the same hash.  Paolo's series cheats here by quickly skipping or even
>> cleaning up expired connection during the lookup.  Gaetan's series' are not 
>> able
>> to clean them, but can still quickly skip all expired ones without executing
>> two costly comparisons.  Hence the code snippet above.  Previously, 
>> conn_expired
>> check required the global lock, but it doesn't with any of the series 
>> applied.
>>
>>
>> In conclusion, I don't think that we can efficiently use MPSC queue directly.
>> At least, we need an ability to efficiently move connections between lists to
>> avoid CLOSED connections being blocked by ESTABLISHED ones.  One option is
>> to always look through the whole list and re-insert every connection.  It 
>> might
>> be not a huge problem though.  It was a problem when we had to hold the 
>> global
>> mutex for that.  If we don't need to hold the global mutex, we may allow 
>> that.
> 
> Some thoughts/questions about it:
> 
> does this require to go all the way through the list at once?
> Probably it's not a problem, but in case of many connections
> destruction/creation, isn't there a chance we keep going through the
> list for a long time?

I'm not sure if that is a problem or not, but I don't really like that
implementation either way.  Such extensive and awkward re-queueing will
not look great.

> 
> For sure it is the most relevant, but I wonder if this is only
> applicable to the ESTABLISHED list.
> I'm thinking to e.g. UDP_FIRST and UDP_MULTIPLE with custom timeouts.

I don't think it's specific to ESTABLISHED list.  It's just more
pronounced on this one, because of dramatic difference in timeout
values.  In general, entries in a list should all have the same
timeout, otherwise we may fall into some nasty corner cases.

> 
> Otherwise,
> 
>> If the list itself is too long, we may just have normal lists and use MPSC
>> queue to enqueue "update" events for the ct_sweep thread the same way as what
>> is done for RCU thread in v1.  Can be a possible solution?
>>
> 
> I guess this involve some reorder (any update could) inside the same
> list.
> 
> In such case we have to handle both the events (and reorder, if we
> care), and the evictions in two separate moments.

I'm not sure if the strict order is required.  Relatively ordered
list where all re-scheduled entries are always after non-rescheduled
ones might be sufficient, if we're not looking for extreme accuracy.
This should work fine, I guess, as long as we re-schedule everything
that was updated within, let's say, a second, and as long as all
connections in the same list has exactly the same timeout value.

> 
>> rculist has a chance to be a viable option, would be great to look at 
>> performance
>> numbers with the code snippet above.
>>
>> Paolo's series seems to not have any obvious issues, at least I didn't spot
>> any on a quick glance.  At least, until buckets are balanced.  In some cases
>> we may end up locking the same bucket from different threads frequently 
>> though.
>> Do you have the original and reply traffic being handled by the same or
>> different PMD threads in your performance tests?  Also, a lot of different
>> connections is only one side of the performance.  Having a single 
>> traffic-heavy
>> connection where original and reply directions are handled by 2 different PMD
>> threads would be an important test as well.  I think, it is even closer to 
>> the
>> real world cases. 
>>
> 
> Will be back on this and the v1/v4 + your snippet (which seems to be
> potentially a measurable improvement) once some numbers are available.
> 
>> Thoughts?
>>
>> Best regards, Ilya Maximets.
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] User space connection tracking benchmarks

Reply via email to