On 4/6/22 16:38, Kumar Amber wrote:
> Hashing Optimization are also included which can further
> improve performance by approximately 10%.
Hi, Amber, others. That's an interesting patch set.
I got my hands on an AVX512-capable system for a few hours,
so I was able to play with it. A few observations:
1. Patch set does improve the performance in my VM-OVS-VM
setup by ~3.5 %. I suppose, your 10% was achieved for
a PHY-OVS-PHY setup, right?
2. The hash function itself seems very generic, so it's not
necessary for it to be strictly part of the avx512 code.
3. Patch #3 doesn't seem to give any performance boost.
At least in my testing, but it creates an assumption that
optimized miniflow_extract always calculates the hash,
which is not great from the generalization point of view
and may harm future attempts to decouple parsing and
datapath implementations.
4. Patches #2 and #4 are just for testing purposes.
5. It appears to be that any function call to a function that
is not part of the avx512 code is extremely expensive.
Mostly because compiler injects VZEROUPPER instructions
just before them all the time. And these are very heavy
if called per-packet. AFAIU, that is the cost of switching
between avx and non-avx instructions.
Did you notice the same?
6. Reading entries from the miniflow is heavier than reading
from the packet itself for some reason. I didn't dig
very deep here.
7. All switch branches in mfex_avx512_process() has exactly
the same code. I suppose, that's a leftover from the
IPv6 series.
8. In-packet offsets are magic numbers, which is not great.
I think, they can be easily defined as a sum of structure
offsets, without any runtime penalties.
9. 'hash_len' is not really a hash length, it's a length of
the miniflow. Calculation of the 'key->len' vs direct
set doesn't seem to make any difference performance-wise.
What I tried:
Since the hash function itself is independent from the avx512
code, I assumed that the same optimization can be applied to
a generic code path. So I attempted to generalize the function,
move it to the common header and use inside the generic
miniflow_extract(). See the quick'n'dirty patch I used for
ipv4 udp traffic below. That gave a great result: the generic
code became faster by 5-9% in my VM-OVS-VM setup! And I'm
guessing that PHY-OVS-PHY should be even better.
The next thing I tried is to replace the mfex_5tuple_hash_ipv4()
call with the generalized function, also removing key->hash
and key->len assignments from it and surrounding code, because
I also reverted patch #3. At this point I didn't notice any
significant performance difference for the avx512 implementation
in comparison with the original patch set. If there is some
difference, the generalized function can likely be tweaked a bit.
Conclusions:
Same optimization can be applied efficiently regardless of
the datapath/miniflow_extract implementation by using the same
common code without any extra sacrifices, but significantly
improving the maintainability of the solution.
Maintainability can be achieved by using actual parsed packet
fields and header offsets without any magic numbers. There is
no need for autovalidation if the hash function is the same in
all cases. There is no need to use custom fuzzing, if we can
add checks to existing fuzzers for oss-fuzz and test-flow.
What do you think?
Below is the code snippet from my experiments (supposed to be
applied on top of the first patch of the set, only ipv4/udp
case modified):
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index ee0805ae6..8bee55d50 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -1085,6 +1085,38 @@ dp_packet_l4_checksum_bad(const struct dp_packet *p)
DP_PACKET_OL_RX_L4_CKSUM_BAD;
}
+static inline void ALWAYS_INLINE
+dp_packet_update_rss_hash_ipv4_tcp_udp(struct dp_packet *packet)
+{
+ if (dp_packet_rss_valid(packet)) {
+ return;
+ }
+
+ const uint8_t *pkt = dp_packet_data(packet);
+ const uint16_t l3_ofs = packet->l3_ofs;
+ const void *ipv4_src = &pkt[l3_ofs + offsetof(struct ip_header, ip_src)];
+ const void *ipv4_dst = &pkt[l3_ofs + offsetof(struct ip_header, ip_dst)];
+ const void *l4_ports = &pkt[packet->l4_ofs];
+ uint32_t ip_src, ip_dst, ports;
+ uint32_t hash = 0;
+
+ memcpy(&ip_src, ipv4_src, sizeof ip_src);
+ memcpy(&ip_dst, ipv4_dst, sizeof ip_dst);
+ memcpy(&ports, l4_ports, sizeof ports);
+
+ /* IPv4 Src and Dst. */
+ hash = hash_add(hash, ip_src);
+ hash = hash_add(hash, ip_dst);
+ /* IPv4 proto. */
+ hash = hash_add(hash,
+ pkt[l3_ofs + offsetof(struct ip_header, ip_proto)]);
+ /* L4 ports. */
+ hash = hash_add(hash, ports);
+ hash = hash_finish(hash, 42);
+
+ dp_packet_set_rss_hash(packet, hash);
+}
+
#ifdef __cplusplus
}
#endif
diff --git a/lib/dpif-netdev-extract-avx512.c b/lib/dpif-netdev-extract-avx512.c
index 1e0d4e31a..341fc6b0d 100644
--- a/lib/dpif-netdev-extract-avx512.c
+++ b/lib/dpif-netdev-extract-avx512.c
@@ -681,10 +681,7 @@ mfex_avx512_process(struct dp_packet_batch *packets,
UDP_HEADER_LEN)) {
continue;
}
-
- mfex_5tuple_hash_ipv4(packet, pkt, &keys[i],
- profile->hash_pkt_offs);
- keys[i].len = profile->hash_len;
+ dp_packet_update_rss_hash_ipv4_tcp_udp(packet);
} break;
default:
break;
diff --git a/lib/flow.c b/lib/flow.c
index dd523c889..4eb9694d6 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -1027,6 +1027,9 @@ miniflow_extract(struct dp_packet *packet, struct
miniflow *dst)
miniflow_push_be16(mf, tp_dst, udp->udp_dst);
miniflow_push_be16(mf, ct_tp_src, ct_tp_src);
miniflow_push_be16(mf, ct_tp_dst, ct_tp_dst);
+ if (dl_type == htons(ETH_TYPE_IP)) {
+ dp_packet_update_rss_hash_ipv4_tcp_udp(packet);
+ }
}
} else if (OVS_LIKELY(nw_proto == IPPROTO_SCTP)) {
if (OVS_LIKELY(size >= SCTP_HEADER_LEN)) {
---
Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev