> Here are some more thoughts about loop unroll... > In another mail [1], you are discussing manual loop unroll for > rte_ipv4/ipv6_phdr_cksum(). > Perhaps the compiler already loop unrolls those. > Check the assembler output for the existing code calling __rte_raw_cksum(). > If the compiler doesn't loop unroll __rte_raw_cksum() for those two > functions, maybe you can help it by modifying __rte_raw_cksum(); try > replacing the end pointer with an int counter, which will be compile time > constant when called by rte_ipv4/ipv6_phdr_cksum(). > > [1]: > https://inbox.dpdk.org/dev/CAFn2buA5NzmzA0+t1_5auigvQTyT7Ne6RMVaPVU=sdc03nd...@mail.gmail.com/ > > PS: I do the following when optimizing inline functions: Add non-inline > functions calling the inline functions, and then use "objdump -S" to look at > the generated code. E.g.: > > uint32_t review__rte_raw_cksum(const void *buf, size_t len, uint32_t sum) > { return __rte_raw_cksum(buf, len, sum); } > > uint32_t review__rte_raw_cksum_len20(const void *buf, uint32_t sum) > { return __rte_raw_cksum(buf, 20, sum); } > > uint32_t review__rte_raw_cksum_len8(const void *buf, uint32_t sum) > { return __rte_raw_cksum(buf, 8, sum); } >
https://godbolt.org/z/qr39hf76s rte_ipv4_phdr_cksum and rte_ipv6_phdr_cksum are both fully unrolled (-O2 or higher). Vectorization also happens (clang chooses not to vectorize ipv4). yay compilers :)

