On 6/20/22 23:57, Paolo Valerio wrote: > Ilya Maximets <[email protected]> writes: > >> On 6/7/22 11:39, Robin Jarry wrote: >>> Paolo Valerio, Jun 05, 2022 at 19:37: >>>> Just a note that may be useful. >>>> After some tests, I noticed that establishing e.g. two TCP connections, >>>> and leaving the first one idle after 3whs, once the second connection >>>> expires (after moving to TIME_WAIT as a result of termination), the >>>> second doesn't get evicted until any event gets scheduled for the first. >>>> >>>> ovs-appctl dpctl/dump-conntrack -s >>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=84576,protoinfo=(state=ESTABLISHED) >>>> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),zone=1,timeout=0,protoinfo=(state=TIME_WAIT) >>>> >>>> This may be somewhat related to your results as during the >>>> test, the number of connections may reach the limit so apparently reducing >>>> the performances. >>> >>> Indeed, there was an issue in my test procedure. Due to the way T-Rex >>> generates connections, it is easy to fill the conntrack table after >>> a few iterations, making the test results inconsistent. >>> >>> Also, the flows which I had configured were not correct. There was an >>> extraneous action=NORMAL flow at the end. When the conntrack table is >>> full and a new packet cannot be tracked, it is marked as +trk+inv and >>> not dropped. This behaviour is specific to the userspace datapath. The >>> linux kernel datapath seems to drop the packet when it cannot be added >>> to connection tracking. >>> >>> Gaƫtan's series (v4) seems less resilient to the conntrack table being >>> full, especially when there is more than one PMD core. >>> >>> I have changed the t-rex script to allow running arbitrary commands in >>> between traffic iterations. This is leveraged to flush the conntrack >>> table and run each iteration in the same conditions. >>> >>> https://github.com/cisco-system-traffic-generator/trex-core/blob/v2.98/scripts/cps_ndr.py >>> >>> To avoid filling the conntrack table, the max size was increased to 50M. >>> The DUT configuration can be summarized as the following: >>> >>> ovs-vsctl set open_vswitch . other_config:dpdk-init=true >>> ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x15554" >>> ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev >>> ovs-vsctl add-port br0 pf0 -- set Interface pf0 type=dpdk \ >>> options:dpdk-devargs=0000:3b:00.0 options:n_rxq=4 >>> options:n_rxq_desc=4096 >>> ovs-vsctl add-port br0 pf1 -- set Interface pf1 type=dpdk \ >>> options:dpdk-devargs=0000:3b:00.1 options:n_rxq=4 >>> options:n_rxq_desc=4096 >>> ovs-appctl dpctl/ct-set-maxconns 50000000 >>> ovs-ofctl add-flow br0 >>> "table=0,priority=10,ip,ct_state=-trk,actions=ct(table=0)" >>> ovs-ofctl add-flow br0 >>> "table=0,priority=10,ip,ct_state=+trk+new,actions=ct(commit),NORMAL" >>> ovs-ofctl add-flow br0 >>> "table=0,priority=10,ip,ct_state=+trk+est,actions=NORMAL" >>> ovs-ofctl add-flow br0 "table=0,priority=0,actions=drop" >>> >>> Short Lived Connections >>> ----------------------- >>> >>> ./cps_ndr.py --sample-time 10 --max-iterations 8 --error-threshold 0.01 \ >>> --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \ >>> --udp-percent 1 --num-messages 1 --message-size 20 --server-wait 0 \ >>> --min-cps 10k --max-cps 600k >>> >>> ============== =============== ============== ============== =========== >>> Series Num. Flows CPS PPS BPS >>> ============== =============== ============== ============== =========== >>> Baseline 40.1K 79.3K/s 556Kp/s 347Mb/s >>> Gaetan v1 60.5K 121K/s 837Kp/s 522Mb/s >>> Gaetan v4 61.4K 122K/s 850Kp/s 530Mb/s >>> Paolo 377K 756K/s 5.3Mp/s 3.3Gb/s >>> ============== =============== ============== ============== =========== >>> >>> Even after fixing the test procedure, Paolo's series still performs >>> a lot better with short lived connections. >>> >>> Long Lived Connections >>> ---------------------- >>> >>> ./cps_ndr.py --sample-time 30 --max-iterations 8 --error-threshold 0.01 \ >>> --reset-command "ssh $dut ovs-appctl dpctl/flush-conntrack" \ >>> --udp-percent 1 --num-messages 500 --message-size 20 --server-wait 50 \ >>> --min-cps 100 --max-cps 10k >>> >>> ============== =============== ============== ============== =========== >>> Series Num. Flows CPS PPS BPS >>> ============== =============== ============== ============== =========== >>> Baseline 17.4K 504/s 633Kp/s 422Mb/s >>> Gaetan v1 80.4K 3.1K/s 4.6Mp/s 3.0Gb/s >>> Gaetan v4 139K 5.4K/s 8.2Mp/s 5.4Gb/s >>> Paolo 132K 5.2K/s 7.7Mp/s 5.2Gb/s >>> ============== =============== ============== ============== =========== >>> >>> Thanks to Paolo for his help on this second round of tests. >> >> Hi, Robin and Paolo. >> > > Hello Ilya, > > thanks for the feedback about this. > >> Could you try the following change on top of Gaetan's v1 and v4: >> >> diff --git a/lib/conntrack.c b/lib/conntrack.c >> index aa4792f69..2c2203a4b 100644 >> --- a/lib/conntrack.c >> +++ b/lib/conntrack.c >> @@ -633,14 +633,17 @@ conn_key_lookup(struct conntrack *ct, const struct >> conn_key *key, >> bool found = false; >> >> CMAP_FOR_EACH_WITH_HASH (conn, cm_node, hash, &ct->conns) { >> - if (!conn_key_cmp(&conn->key, key) && !conn_expired(conn, now)) { >> + if (conn_expired(conn, now)) { >> + continue; >> + } >> + if (!conn_key_cmp(&conn->key, key)) { >> found = true; >> if (reply) { >> *reply = false; >> } >> break; >> } >> - if (!conn_key_cmp(&conn->rev_key, key) && !conn_expired(conn, now)) >> { >> + if (!conn_key_cmp(&conn->rev_key, key)) { >> found = true; >> if (reply) { >> *reply = true; >> --- >> >> ? >> >> Let me explain. As far as I can tell without actually running performance >> tests myself, the main issue with both variants of Gaetan's series is a slow >> cleanup of expired connections. >> >> v4, IMO, contains a serious bug, because the expiration list is treated at >> the >> sweep phase as if it was sorted by the expiration time. But that is not >> really >> true, because connections can not be re-ordered. They seem to stay in the >> exact >> same order as they were created, but expiration times are getting updated >> when >> packets are received. However, the sweep process stops at the moment the >> first >> non-expired connection is seen and we're not cleaning up later connections >> as reported by Paolo. Taking into account the fact that default timeout >> policy >> for established TCP connections appears to be 24 hours (or did I misread >> this?), >> closed connections can stay not cleaned up for that time. That seems like >> way too much. The point is, the whole queue should be inspected on every >> sweep >> iteration in order to re-sort it and actually move connections between >> different >> expiration lists to avoid silent established connections blocking closed >> ones. >> This will probably be slow (?). MPSC queue doesn't seem to be very suitable >> for >> such usecase. > > I agree with the cause of the issue, and about the MPSC suitability as > well unless they are used in different ways. > > Speaking of connections blocking other connections, there's a potential > problem in the current code and it is related to zones and timeout > policy. > > tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9090,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9090),zone=1,timeout=0,protoinfo=(state=TIME_WAIT) > tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=9091,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=9091),timeout=180,protoinfo=(state=TIME_WAIT) > > In this case I set the timeout for tcp_end to 200 for zone 0, and 20 > seconds to zone 1. > The problem is less severe compared to the ct-scale series, but could > (more or less significantly, depends on the way timeouts are configured) > increase the latency of a connection in the same list.
Yep. I noticed that issue too. It is a bug in the current implementation. Expiration lists should not be global, but per-zone, AFAICT. > >> >> v1. Here the situation is better since threads are kind of allowed to >> reschedule >> on their own, but via RCU thread. It looks like re-scheduled connections >> can be >> messed up between each other still though, because expiration can be updated >> multiple times, while re-insert is happening once. And also these >> remove+push_back > > which also makes me wonder if there's a chance of losing a timeout > transition. > >> operations for different threads will be additionally re-ordered by the RCU >> thread. So, the order is not really preserved here as well. In order to fix >> that we need to replace push_back operation with the search+insert, i.e. >> find the >> place where the connection should be inserted. And even that will not fully >> protect from the concurrent update of the expiration. OTOH, not fully sorted >> list may be not a huge problem, because connections are still moved between >> different expiration lists, so ESTABLISHED connection can not block clean up >> of >> the closed one. >> The RCU thread though can be a serious bottleneck in this concept. PMD >> threads >> are entering quiescent state only once per 1024 iterations or even more, so >> we can have 64K or more updates per thread per grace period. So, expired or >> closed connections can stay for a prolonged period of time. > > That's a good point. > >> >> So, in both v1 and v4 of Gaetan's series there are, probably, a lot of >> elements >> in a hash map and lookup is likely slow because of this. Also, >> conn_key_cmp() >> is an extremely heavy operation, and we're performing it twice for every >> connection >> with the same hash. Paolo's series cheats here by quickly skipping or even >> cleaning up expired connection during the lookup. Gaetan's series' are not >> able >> to clean them, but can still quickly skip all expired ones without executing >> two costly comparisons. Hence the code snippet above. Previously, >> conn_expired >> check required the global lock, but it doesn't with any of the series >> applied. >> >> >> In conclusion, I don't think that we can efficiently use MPSC queue directly. >> At least, we need an ability to efficiently move connections between lists to >> avoid CLOSED connections being blocked by ESTABLISHED ones. One option is >> to always look through the whole list and re-insert every connection. It >> might >> be not a huge problem though. It was a problem when we had to hold the >> global >> mutex for that. If we don't need to hold the global mutex, we may allow >> that. > > Some thoughts/questions about it: > > does this require to go all the way through the list at once? > Probably it's not a problem, but in case of many connections > destruction/creation, isn't there a chance we keep going through the > list for a long time? I'm not sure if that is a problem or not, but I don't really like that implementation either way. Such extensive and awkward re-queueing will not look great. > > For sure it is the most relevant, but I wonder if this is only > applicable to the ESTABLISHED list. > I'm thinking to e.g. UDP_FIRST and UDP_MULTIPLE with custom timeouts. I don't think it's specific to ESTABLISHED list. It's just more pronounced on this one, because of dramatic difference in timeout values. In general, entries in a list should all have the same timeout, otherwise we may fall into some nasty corner cases. > > Otherwise, > >> If the list itself is too long, we may just have normal lists and use MPSC >> queue to enqueue "update" events for the ct_sweep thread the same way as what >> is done for RCU thread in v1. Can be a possible solution? >> > > I guess this involve some reorder (any update could) inside the same > list. > > In such case we have to handle both the events (and reorder, if we > care), and the evictions in two separate moments. I'm not sure if the strict order is required. Relatively ordered list where all re-scheduled entries are always after non-rescheduled ones might be sufficient, if we're not looking for extreme accuracy. This should work fine, I guess, as long as we re-schedule everything that was updated within, let's say, a second, and as long as all connections in the same list has exactly the same timeout value. > >> rculist has a chance to be a viable option, would be great to look at >> performance >> numbers with the code snippet above. >> >> Paolo's series seems to not have any obvious issues, at least I didn't spot >> any on a quick glance. At least, until buckets are balanced. In some cases >> we may end up locking the same bucket from different threads frequently >> though. >> Do you have the original and reply traffic being handled by the same or >> different PMD threads in your performance tests? Also, a lot of different >> connections is only one side of the performance. Having a single >> traffic-heavy >> connection where original and reply directions are handled by 2 different PMD >> threads would be an important test as well. I think, it is even closer to >> the >> real world cases. >> > > Will be back on this and the v1/v4 + your snippet (which seems to be > potentially a measurable improvement) once some numbers are available. > >> Thoughts? >> >> Best regards, Ilya Maximets. > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
