Initializing the dp_packet's metadata can be a hot spot, especially
for very simple pipelines. Therefore improving the code here can
sometimes make a difference.
Using memcpy instead of a plain assignment helps GCC and clang generate
faster code. Here's a comparison of the compiler generated code (GCC 4.8)
with or without this commit.
BEFORE (assignment) | AFTER(memcpy)
c8: add $0x8,%r8 | d8: mov (%rsi),%r8
mov (%rcx),%r9 | mov (%rdx),%rdi
mov (%rbx),%r11d | add $0x1,%ecx
mov %r10,%rcx | add $0x8,%rsi
cmp %rsi,%r8 | cmp -0x870(%rbp),%ecx
lea 0x88(%r9),%rdi | mov %rdi,0x88(%r8)
rep stos %rax,%es:(%rdi) | mov 0x8(%rdx),%rdi
mov %r11d,0xb8(%r9) | lea 0x88(%r8),%rax
mov %r8,%rcx | mov %rdi,0x90(%r8)
jne c8 | mov 0x10(%rdx),%rdi
| mov %rdi,0x98(%r8)
| mov 0x18(%rdx),%rdi
| mov %rdi,0xa0(%r8)
| mov 0x20(%rdx),%r8
| mov %r8,0x20(%rax)
| mov 0x28(%rdx),%r8
| mov %r8,0x28(%rax)
| mov 0x30(%rdx),%r8
| mov %r8,0x30(%rax)
| jl d8
The old code uses a 'rep stos' and fetches the 'port_no' value from
the 'port' member at every iteration ('mov (%rbx),%r11d'), while the
new code uses a series of mov operation to accomplish everything.
I can measure a through improvement of ~7% on a single flow phy-phy test
with 64 bytes UDP packets.
The improvement has been observed on an Intel Xeon Sandy Bridge (2012)
and on an Intel Xeon Westmere (2010).
Signed-off-by: Daniele Di Proietto <[email protected]>
---
lib/dpif-netdev.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index f1d65f5..7d55997 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2507,13 +2507,16 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread
*pmd,
error = netdev_rxq_recv(rxq, packets, &cnt);
cycles_count_end(pmd, PMD_CYCLES_POLLING);
if (!error) {
+ const struct pkt_metadata md = PKT_METADATA_INITIALIZER(port->port_no);
int i;
*recirc_depth_get() = 0;
/* XXX: initialize md in netdev implementation. */
for (i = 0; i < cnt; i++) {
- packets[i]->md = PKT_METADATA_INITIALIZER(port->port_no);
+ /* Use a memcpy instead of an assignment because it helps GCC and
+ * clang generate better code (even if the call gets inlined) */
+ memcpy(&packets[i]->md, &md, sizeof md);
}
cycles_count_start(pmd);
dp_netdev_input(pmd, packets, cnt);
--
2.1.4
_______________________________________________
dev mailing list
[email protected]
http://openvswitch.org/mailman/listinfo/dev