Re: [ovs-dev] User space connection tracking benchmarks

Paolo Valerio Fri, 24 Jun 2022 09:53:11 -0700

Ilya Maximets <[email protected]> writes:

> On 6/20/22 23:57, Paolo Valerio wrote:
>> Ilya Maximets <[email protected]> writes:
>> 
>>> On 6/7/22 11:39, Robin Jarry wrote:
>>>> Paolo Valerio, Jun 05, 2022 at 19:37:
>>>>> Just a note that may be useful.
>>>>> After some tests, I noticed that establishing e.g. two TCP connections,
>>>>> and leaving the first one idle after 3whs, once the second connection
>>>>> expires (after moving to TIME_WAIT as a result of termination), the
>>>>> second doesn't get evicted until any event gets scheduled for the first.
>>>>>
>>>>> ovs-appctl dpctl/dump-conntrack -s
>>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=84576,protoinfo=(state=ESTABLISHED)
>>>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),zone=1,timeout=0,protoinfo=(state=TIME_WAIT)
>>>>>
>>>>> This may be somewhat related to your results as during the
>>>>> test, the number of connections may reach the limit so apparently reducing
>>>>> the performances.
>>>>
>>>> Indeed, there was an issue in my test procedure. Due to the way T-Rex
>>>> generates connections, it is easy to fill the conntrack table after
>>>> a few iterations, making the test results inconsistent.
>>>>
>>>> Also, the flows which I had configured were not correct. There was an
>>>> extraneous action=NORMAL flow at the end. When the conntrack table is
>>>> full and a new packet cannot be tracked, it is marked as +trk+inv and
>>>> not dropped. This behaviour is specific to the userspace datapath. The
>>>> linux kernel datapath seems to drop the packet when it cannot be added
>>>> to connection tracking.
>>>>
>>>> Gaëtan's series (v4) seems less resilient to the conntrack table being
>>>> full, especially when there is more than one PMD core.
>>>>
>>>> I have changed the t-rex script to allow running arbitrary commands in
>>>> between traffic iterations. This is leveraged to flush the conntrack
>>>> table and run each iteration in the same conditions.
>>>>
>>>> https://github.com/cisco-system-traffic-generator/trex-core/blob/v2.98/scripts/cps_ndr.py
>>>>
>>>> To avoid filling the conntrack table, the max size was increased to 50M.
>>>> The DUT configuration can be summarized as the following:
>>>>
>>>> ovs-vsctl set open_vswitch . other_config:dpdk-init=true
>>>> ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x15554"
>>>> ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
>>>> ovs-vsctl add-port br0 pf0 -- set Interface pf0 type=dpdk \
>>>>     options:dpdk-devargs=0000:3b:00.0 options:n_rxq=4 
>>>> options:n_rxq_desc=4096
>>>> ovs-vsctl add-port br0 pf1 -- set Interface pf1 type=dpdk \
>>>>     options:dpdk-devargs=0000:3b:00.1 options:n_rxq=4 
>>>> options:n_rxq_desc=4096
>>>> ovs-appctl dpctl/ct-set-maxconns 50000000
>>>> ovs-ofctl add-flow br0 
>>>> "table=0,priority=10,ip,ct_state=-trk,actions=ct(table=0)"
>>>> ovs-ofctl add-flow br0 
>>>> "table=0,priority=10,ip,ct_state=+trk+new,actions=ct(commit),NORMAL"
>>>> ovs-ofctl add-flow br0 
>>>> "table=0,priority=10,ip,ct_state=+trk+est,actions=NORMAL"
>>>> ovs-ofctl add-flow br0 "table=0,priority=0,actions=drop"
>>>>
>>>> Short Lived Connections
>>>> -----------------------
>>>>
>>>> ./cps_ndr.py --sample-time 10 --max-iterations 8 --error-threshold 0.01 \
>>>>     --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \
>>>>     --udp-percent 1 --num-messages 1 --message-size 20 --server-wait 0 \
>>>>     --min-cps 10k --max-cps 600k
>>>>
>>>> ============== =============== ============== ============== ===========
>>>> Series         Num. Flows      CPS            PPS            BPS
>>>> ============== =============== ============== ============== ===========
>>>> Baseline       40.1K           79.3K/s        556Kp/s        347Mb/s
>>>> Gaetan v1      60.5K           121K/s         837Kp/s        522Mb/s
>>>> Gaetan v4      61.4K           122K/s         850Kp/s        530Mb/s
>>>> Paolo          377K            756K/s         5.3Mp/s        3.3Gb/s
>>>> ============== =============== ============== ============== ===========
>>>>
>>>> Even after fixing the test procedure, Paolo's series still performs
>>>> a lot better with short lived connections.
>>>>
>>>> Long Lived Connections
>>>> ----------------------
>>>>
>>>> ./cps_ndr.py --sample-time 30 --max-iterations 8 --error-threshold 0.01 \
>>>>     --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \
>>>>     --udp-percent 1 --num-messages 500 --message-size 20 --server-wait 50 \
>>>>     --min-cps 100 --max-cps 10k
>>>>
>>>> ============== =============== ============== ============== ===========
>>>> Series         Num. Flows      CPS            PPS            BPS
>>>> ============== =============== ============== ============== ===========
>>>> Baseline       17.4K           504/s          633Kp/s        422Mb/s
>>>> Gaetan v1      80.4K           3.1K/s         4.6Mp/s        3.0Gb/s
>>>> Gaetan v4      139K            5.4K/s         8.2Mp/s        5.4Gb/s
>>>> Paolo          132K            5.2K/s         7.7Mp/s        5.2Gb/s
>>>> ============== =============== ============== ============== ===========
>>>>
>>>> Thanks to Paolo for his help on this second round of tests.
>>>
>>> Hi, Robin and Paolo.
>>>
>> 
>> Hello Ilya,
>> 
>> thanks for the feedback about this.
>> 
>>> Could you try the following change on top of Gaetan's v1 and v4:
>>>
>>> diff --git a/lib/conntrack.c b/lib/conntrack.c
>>> index aa4792f69..2c2203a4b 100644
>>> --- a/lib/conntrack.c
>>> +++ b/lib/conntrack.c
>>> @@ -633,14 +633,17 @@ conn_key_lookup(struct conntrack *ct, const struct 
>>> conn_key *key,
>>>      bool found = false;
>>>  
>>>      CMAP_FOR_EACH_WITH_HASH (conn, cm_node, hash, &ct->conns) {
>>> -        if (!conn_key_cmp(&conn->key, key) && !conn_expired(conn, now)) {
>>> +        if (conn_expired(conn, now)) {
>>> +            continue;
>>> +        }
>>> +        if (!conn_key_cmp(&conn->key, key)) {
>>>              found = true;
>>>              if (reply) {
>>>                  *reply = false;
>>>              }
>>>              break;
>>>          }
>>> -        if (!conn_key_cmp(&conn->rev_key, key) && !conn_expired(conn, 
>>> now)) {
>>> +        if (!conn_key_cmp(&conn->rev_key, key)) {
>>>              found = true;
>>>              if (reply) {
>>>                  *reply = true;
>>> ---
>>>
>>> ?
>>>
>>> Let me explain.  As far as I can tell without actually running performance
>>> tests myself, the main issue with both variants of Gaetan's series is a slow
>>> cleanup of expired connections.
>>>
>>> v4, IMO, contains a serious bug, because the expiration list is treated at 
>>> the
>>> sweep phase as if it was sorted by the expiration time.  But that is not 
>>> really
>>> true, because connections can not be re-ordered.  They seem to stay in the 
>>> exact
>>> same order as they were created, but expiration times are getting updated 
>>> when
>>> packets are received.  However, the sweep process stops at the moment the 
>>> first
>>> non-expired connection is seen and we're not cleaning up later connections
>>> as reported by Paolo.  Taking into account the fact that default timeout 
>>> policy
>>> for established TCP connections appears to be 24 hours (or did I misread 
>>> this?),
>>> closed connections can stay not cleaned up for that time.  That seems like
>>> way too much.  The point is, the whole queue should be inspected on every 
>>> sweep
>>> iteration in order to re-sort it and actually move connections between 
>>> different
>>> expiration lists to avoid silent established connections blocking closed 
>>> ones.
>>> This will probably be slow (?).  MPSC queue doesn't seem to be very 
>>> suitable for
>>> such usecase.
>> 
>> I agree with the cause of the issue, and about the MPSC suitability as
>> well unless they are used in different ways.
>> 
>> Speaking of connections blocking other connections, there's a potential
>> problem in the current code and it is related to zones and timeout
>> policy.
>> 
>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=0,protoinfo=(state=TIME_WAIT)
>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),timeout=180,protoinfo=(state=TIME_WAIT)
>> 
>> In this case I set the timeout for tcp_end to 200 for zone 0, and 20
>> seconds to zone 1.
>> The problem is less severe compared to the ct-scale series, but could
>> (more or less significantly, depends on the way timeouts are configured)
>> increase the latency of a connection in the same list.
>
> Yep.  I noticed that issue too.  It is a bug in the current implementation.
> Expiration lists should not be global, but per-zone, AFAICT.
>
>> 
>>>
>>> v1.  Here the situation is better since threads are kind of allowed to 
>>> reschedule
>>> on their own, but via RCU thread.  It looks like re-scheduled connections 
>>> can be
>>> messed up between each other still though, because expiration can be updated
>>> multiple times, while re-insert is happening once.  And also these 
>>> remove+push_back
>> 
>> which also makes me wonder if there's a chance of losing a timeout
>> transition.
>> 
>>> operations for different threads will be additionally re-ordered by the RCU
>>> thread.  So, the order is not really preserved here as well.  In order to 
>>> fix
>>> that we need to replace push_back operation with the search+insert, i.e. 
>>> find the
>>> place where the connection should be inserted.  And even that will not fully
>>> protect from the concurrent update of the expiration.  OTOH, not fully 
>>> sorted
>>> list may be not a huge problem, because connections are still moved between
>>> different expiration lists, so ESTABLISHED connection can not block clean 
>>> up of
>>> the closed one. 
>>> The RCU thread though can be a serious bottleneck in this concept.  PMD 
>>> threads
>>> are entering quiescent state only once per 1024 iterations or even more, so
>>> we can have 64K or more updates per thread per grace period.   So, expired 
>>> or
>>> closed connections can stay for a prolonged period of time.
>> 
>> That's a good point.
>> 
>>>
>>> So, in both v1 and v4 of Gaetan's series there are, probably, a lot of 
>>> elements
>>> in a hash map and lookup is likely slow because of this.  Also, 
>>> conn_key_cmp()
>>> is an extremely heavy operation, and we're performing it twice for every 
>>> connection
>>> with the same hash.  Paolo's series cheats here by quickly skipping or even
>>> cleaning up expired connection during the lookup.  Gaetan's series' are not 
>>> able
>>> to clean them, but can still quickly skip all expired ones without executing
>>> two costly comparisons.  Hence the code snippet above.  Previously, 
>>> conn_expired
>>> check required the global lock, but it doesn't with any of the series 
>>> applied.
>>>
>>>
>>> In conclusion, I don't think that we can efficiently use MPSC queue 
>>> directly.
>>> At least, we need an ability to efficiently move connections between lists 
>>> to
>>> avoid CLOSED connections being blocked by ESTABLISHED ones.  One option is
>>> to always look through the whole list and re-insert every connection.  It 
>>> might
>>> be not a huge problem though.  It was a problem when we had to hold the 
>>> global
>>> mutex for that.  If we don't need to hold the global mutex, we may allow 
>>> that.
>> 
>> Some thoughts/questions about it:
>> 
>> does this require to go all the way through the list at once?
>> Probably it's not a problem, but in case of many connections
>> destruction/creation, isn't there a chance we keep going through the
>> list for a long time?
>
> I'm not sure if that is a problem or not, but I don't really like that
> implementation either way.  Such extensive and awkward re-queueing will
> not look great.
>
>> 
>> For sure it is the most relevant, but I wonder if this is only
>> applicable to the ESTABLISHED list.
>> I'm thinking to e.g. UDP_FIRST and UDP_MULTIPLE with custom timeouts.
>
> I don't think it's specific to ESTABLISHED list.  It's just more
> pronounced on this one, because of dramatic difference in timeout
> values.  In general, entries in a list should all have the same
> timeout, otherwise we may fall into some nasty corner cases.
>


I know it could not be ideal, but using more rcu_lists (during the
connection creation the conns get assigned in round robin to one lists
to have them balanced), not associated with any timeout, and scanning
them, without reordering, all the way at every reschedule time checking
for expiration only (atomically updated by the PMDs) should solve the
problems (even the one related to the zones). This approach would add an
extra cost to the sweeper, because it has to scan every time all the
connections, but push_{front,back} and event handling also come with a
non negligible cost.

More rcu_lists may be needed because this would allow the sweeper to
yield the CPU (once exceeded the budget) and restart from the next list
in line the next round, if needed. I guess we could also reduce the
creation/destruction problem (which lead the sweeper to a long walk
through the lists) pushing front every new connection created.

>> 
>> Otherwise,
>> 
>>> If the list itself is too long, we may just have normal lists and use MPSC
>>> queue to enqueue "update" events for the ct_sweep thread the same way as 
>>> what
>>> is done for RCU thread in v1.  Can be a possible solution?
>>>
>> 
>> I guess this involve some reorder (any update could) inside the same
>> list.
>> 
>> In such case we have to handle both the events (and reorder, if we
>> care), and the evictions in two separate moments.
>
> I'm not sure if the strict order is required.  Relatively ordered
> list where all re-scheduled entries are always after non-rescheduled
> ones might be sufficient, if we're not looking for extreme accuracy.
> This should work fine, I guess, as long as we re-schedule everything
> that was updated within, let's say, a second, and as long as all
> connections in the same list has exactly the same timeout value.
>
>> 
>>> rculist has a chance to be a viable option, would be great to look at 
>>> performance
>>> numbers with the code snippet above.
>>>
>>> Paolo's series seems to not have any obvious issues, at least I didn't spot
>>> any on a quick glance.  At least, until buckets are balanced.  In some cases
>>> we may end up locking the same bucket from different threads frequently 
>>> though.

that's right. I suppose that this would affect the ability to sustain a
high rate of conns creation/destruction, as the locks are acquired in
that case. The short-lived connection test result would sink in that
case, but that should lead to numbers comparable to the other series, or
slightly worse (slightly worse is what I see recompiling with one bucket
only). A significant unbalance (like a significant number of connections
in very few buckets) might be detrimental for the sweeper, though (which
would run for a long time in that case).

Long-lived or a bulk transfer with a single connection should not be
affected.

>>> Do you have the original and reply traffic being handled by the same or
>>> different PMD threads in your performance tests?  Also, a lot of different
>>> connections is only one side of the performance.  Having a single 
>>> traffic-heavy
>>> connection where original and reply directions are handled by 2 different 
>>> PMD
>>> threads would be an important test as well.  I think, it is even closer to 
>>> the
>>> real world cases. 
>>>
>> 
>> Will be back on this and the v1/v4 + your snippet (which seems to be
>> potentially a measurable improvement) once some numbers are available.
>>

Back on this and the single flow.

We tested it and the numbers are not comparable with the previous, as we
tested it on different hw (and slightly different setup), but the
improvement we measured is around 5% on v4 and around 4% on v1 for short
lived connections.

No measurable improvement for long lived connections, which makes sense
as it deals with much lower cps.

Regarding the single connection, for the time being at least the test
was perfomed with a TCP connection with different pmds (same goes for
the previous tests) handling the connection and the performance look
comparable. Gaetan's v4 performs slightly better:


| series      | MSS 1400       | MSS 100        |
|-------------+----------------+----------------|
| ct-scale-v4 | 12.7 Gb/sec    | 1.60 Gb/sec    |
| ct-buckets  | 12.6 Gb/sec    | 1.54 Gb/sec    |
| baseline    | 11.9 Gb/sec    | 1.14 Gb/sec    |

These numbers could be further confirmed using a more performing
generator (if needed), as the one used was not very performing, but the
expectation is that the numbers will be confirmed

>>> Thoughts?
>>>
>>> Best regards, Ilya Maximets.
>> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] User space connection tracking benchmarks

Reply via email to