> Plus the benchmark was bogus anyway, and when I built a more specific
> harness -- actually comparing the TCP sequence number functions --
> SipHash was faster than MD5, even on register starved x86. So I think
> we're fine and this chapter of the discussion can come to a close, in
> order to move on to more interesting things.

Do we have to go through this?  No, the benchmark was *not* bogus.

Here's myresults from *your* benchmark.  I can't reboot some of my test
machines, so I took net/core/secure_seq.c, lib/siphash.c, lib/md5.c and
include/linux/siphash.h straight out of your test tree.

Then I replaced the kernel #includes with the necessary typedefs
and #defines to make it compile in user-space.  (Voluminous but
straightforward.)  E.g.

#define __aligned(x) __attribute__((__aligned__(x)))
#define ____cacheline_aligned __aligned(64)
#define CONFIG_INET 1
#define IS_ENABLED(x) 1
#define ktime_get_real_ns() 0
#define sysctl_tcp_timestamps 0

... etc.

Then I modified your benchmark code into the appended code.  The
differences are:
* I didn't iterate 100K times, I timed the functions *once*.
* I saved the times in a buffer and printed them all at the end
  so printf() wouldn't pollute the caches.
* Before every even-numbered iteration, I flushed the I-cache
  of everything from _init to _fini (i.e. all the non-library code).
  This cold-cache case is what is going to happen in the kernel.

In the results below, note that I did *not* re-flush between phases
of the test.  The effects of cacheing is clearly apparent in the tcpv4
results, where the tcpv6 code loaded the cache.

You can also see that the SipHash code benefits more from cacheing when
entered with a cold cache, as it iterates over the input words, while
the MD5 code is one big unrolled blob.

Order of computation is down the columns first, across second.

The P4 results were:
tcpv6 md5 cold:         4084    3488    3584    3584    3568
tcpv4 md5 cold:         1052     996     996    1060     996
tcpv6 siphash cold:     4080    3296    3312    3296    3312
tcpv4 siphash cold:     2968    2748    2972    2716    2716
tcpv6 md5 hot:           900     712     712    712      712
tcpv4 md5 hot:           632     672     672    672      672
tcpv6 siphash hot:      2484    2292    2340    2340    2340
tcpv4 siphash hot:      1660    1560    1564    2340    1564

SipHash actually wins slightly in the cold-cache case, because
it iterates more.  In the hot-cache case, it loses horribly.

Core 2 duo:
tcpv6 md5 cold:         3396    2868    2964    3012    2832
tcpv4 md5 cold:         1368    1044    1320    1332    1308
tcpv6 siphash cold:     2940    2952    2916    2448    2604
tcpv4 siphash cold:     3192    2988    3576    3504    3624
tcpv6 md5 hot:          1116    1032     996    1008    1008
tcpv4 md5 hot:           936     936     936     936     936
tcpv6 siphash hot:      1200    1236    1236    1188    1188
tcpv4 siphash hot:       936     804     804     804     804

Pretty much a tie, honestly.

Ivy Bridge:
tcpv6 md5 cold:         6086    6136    6962    6358    6060
tcpv4 md5 cold:          816     732    1046    1054    1012
tcpv6 siphash cold:     3756    1886    2152    2390    2566
tcpv4 siphash cold:     3264    2108    3026    3120    3526
tcpv6 md5 hot:          1062     808     824     824     832
tcpv4 md5 hot:           730     730     740     748     748
tcpv6 siphash hot:       960     952     936    1112     926
tcpv4 siphash hot:       638     544     562     552     560

Modern processors *hate* cold caches.  But notice how md5 is *faster*
than SipHash on hot-cache IPv6.

Ivy Bridge, -m64:
tcpv6 md5 cold:         4680    3672    3956    3616    3525
tcpv4 md5 cold:         1066    1416    1179    1179    1134
tcpv6 siphash cold:      940    1258    1995    1609    2255
tcpv4 siphash cold:     1440    1269    1292    1870    1621
tcpv6 md5 hot:          1372    1111    1122    1088    1088
tcpv4 md5 hot:           997     997     997     997     998
tcpv6 siphash hot:       340     340     340     352     340
tcpv4 siphash hot:       227     238     238     238     238

Of course, when you compile -m64, SipHash is unbeatable.


Here's the modified benchmark() code.  The entire package is
a bit voluminous for the mailing list, but anyone is welcome to it.

static void clflush(void)
{
        extern char const _init, _fini;
        char const *p = &_init;

        while (p < &_fini) {
                asm("clflush %0" : : "m" (*p));
                p += 64;
        }
}

typedef uint32_t cycles_t;
static cycles_t get_cycles(void)
{
        uint32_t eax, edx;
        asm volatile("rdtsc" : "=a" (eax), "=d" (edx));
        return eax;
}

static int benchmark(void)
{
        cycles_t start, finish;
        int i;
        u32 seq_number = 0;
        __be32 saddr6[4] = { 1, 4, 182, 393 }, daddr6[4] = { 9192, 18288, 
2222222, 0xffffff10 };
        __be32 saddr4 = 28888, daddr4 = 182112;
        __be16 sport = 22, dport = 41992;
        u32 tsoff;
        cycles_t result[4];

        printf("seq num benchmark\n");

        for (i = 0; i < 10; i++) {
                if ((i & 1) == 0)
                        clflush();

                start = get_cycles();
                seq_number += secure_tcpv6_sequence_number_md5(saddr6, daddr6, 
sport, dport, &tsoff);
                finish = get_cycles();
                result[0] = finish - start;

                start = get_cycles();
                seq_number += secure_tcp_sequence_number_md5(saddr4, daddr4, 
sport, dport, &tsoff);
                finish = get_cycles();
                result[1] = finish - start;

                start = get_cycles();
                seq_number += secure_tcpv6_sequence_number(saddr6, daddr6, 
sport, dport, &tsoff);
                finish = get_cycles();
                result[2] = finish - start;

                start = get_cycles();
                seq_number += secure_tcp_sequence_number(saddr4, daddr4, sport, 
dport, &tsoff);
                finish = get_cycles();
                result[3] = finish - start;

                printf("* Iteration %d results:\n", i);
                printf("secure_tcpv6_sequence_number_md5# cycles: %u\n", 
result[0]);
                printf("secure_tcp_sequence_number_md5# cycles: %u\n", 
result[1]);
                printf("secure_tcpv6_sequence_number_siphash# cycles: %u\n", 
result[2]);
                printf("secure_tcp_sequence_number_siphash# cycles: %u\n", 
result[3]);
                printf("benchmark result: %u\n", seq_number);
        }

        printf("benchmark result: %u\n", seq_number);
        return 0;
}
//device_initcall(benchmark);

int
main(void)
{
        memset(net_secret, 0xff, sizeof net_secret);
        memset(net_secret_md5, 0xff, sizeof net_secret_md5);
        return benchmark();
}
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to