Oops, I forgot to attach the assembly generated by gcc-4.9.2. My bad. And doubly my bad: I was straight up wrong; the slow version is much slower (1.2s vs .3s with -Ofast). I had screwed up the conditionals that chose which branch to run. Well, I gotta admit it if I'm wrong.... But I think the other points still stand.
On Tue, Dec 1, 2015 at 7:46 AM, Dan Cross <[email protected]> wrote: > On Tue, Dec 1, 2015 at 12:35 AM, 'Davide Libenzi' via Akaros < > [email protected]> wrote: > >> On Mon, Nov 30, 2015 at 8:07 PM, Dan Cross <[email protected]> wrote: >> >>> On Sun, Nov 29, 2015 at 8:04 PM, 'Davide Libenzi' via Akaros < >>> [email protected]> wrote: >>> >>>> On Sun, Nov 29, 2015 at 2:28 PM, Dan Cross <[email protected]> wrote: >>>> >>>>> But very often these functions are on the boundary of a program; a >>>>> place where performance doesn't matter that much. >>>>> >>>> >>>> Agreed, in this case does not matter much. >>>> But that code has two issues. Performance, and use macros when static >>>> inlines are clearly the better choice (PBIT32(buf, >>>> some_side_effect_or_slow_func(..)) anybody?). >>>> >>> >>> So in fairness, that code was originally written well before 'static >>> inline' was a thing. >>> >> >> They were certainly here in 2012 😉 >> TUESDAY, APRIL 03, 2012 >> > > ...but Rob didn't write any macros in that blog post.... > > >> On my Mac, using gcc-5.2, I get the exact same assembly code for both >>> forms. See attached; the .s file was produced by, "gcc-mp-5 -Os -S bench.c". >>> >> >> Yes, *5.2* does that. >> Barrett showed up some code which looked like hell, plus flying sharks >> armed with RPG and AKs, while my GCC (4.8.2 - AKA 2014) generated code >> which simply looked like hell. >> So, which one is it? Was it meant for old compilers (like above), or was >> it meant for compilers which showed up in 2015? ☺ >> > > Let's run it through gcc-4.9.2 (see attached: gcc-mp-4.9 -std=c11 -fasm > -Os -S bench.c). Yeah, the assembly is not as pretty, but did you actually > measure the elapsed runtime? They appear to be about the same to me: > > : hurricane; time ./bench abcd fast > t 1684234849000000000 > > real 0m0.339s > user 0m0.333s > sys 0m0.003s > : hurricane; time ./bench abcd slow > t 1684234849000000000 > > real 0m0.334s > user 0m0.328s > sys 0m0.003s > : hurricane; > > Further, you're arguing for a technique based on hardware that didn't make > this fast until pretty recently (I can't remember when unaligned access > became fast in x86). Sure, the object code is a bit bigger (4 words instead > of 2 bytes) so it takes up more space in icache, but for something this > small, I don't think it matters. Moral: measure, but only when it can be > shown that it's important. > > Even in ARM, ARM64 that is (the only thing that matters, eventually, for >>>> Akaros), a single load/store is faster than open coding. >>>> Unaligned faulting (or sucking) junk is thing of the past. Processors >>>> doing that are either dead, or turning around with new silicon versions. >>>> >>> >>> That's a dangerous assumption, and as there's clearly no harm in writing >>> it the portable way since I get the same output anyway, I don't see a point >>> in making the assumption. >>> >> >> Note that nobody was trying to push anything which wasn't portable. You >> came up with the assembly thing. >> > > '*(uint32_t *)p;' isn't portable because of alignment issues (unless you > can guarantee that p always points to properly aligned data). Sure, you can > wrap that up in an 'ifdef' so that you don't compile it on a system where > alignment is important, but the code itself is still inherently unportable. > ifdef'ing it out or handwaving away platforms where it matters doesn't > really change that. I'd rather just write one version of the code that's > portable. > > I think the overarching point of Rob's post was that if a programmer feels >>>>> like s/he needs to write something to deal with endianness of the machine >>>>> one is on, one's almost certainly going to be wrong. >>>>> >>>> >>>> Really? And who's this guy? Anyone I can recognize here?😀 >>>> >>> >>> Rob Pike? No, he's not one of the scientists in that picture (cool >>> picture by the way). But he is this guy: >>> https://en.wikipedia.org/wiki/The_Unix_Programming_Environment >>> https://en.wikipedia.org/wiki/The_Practice_of_Programming >>> >> >> I will always be taking hard shots to the guys which assume that either >> "other people will get it wrong", or, along the same lines, "other people >> will fail because they failed". >> > > ...but he didn't fail at anything. His point is absolutely correct. > > But now one is resorting to tricks with conditional compilation (which is >>>>> kind of Rob's point), which gives rise to a combinatorial explosion of >>>>> options that I can't compile and test everywhere. >>>>> >>>> >>>> So instead of LE && FAST_UNALIGNED one has >>>>> /sys/src/libc/x86_64/accessdata.s that has things like, >>>>> >>>>> ; X >>>>> .globl gets32le >>>>> gets32le: >>>>> movl (%rdi),%eax >>>>> ret >>>>> >>>>> Then I just link against the library that's right for my platform. No >>>>> #ifdef's in site. >>>>> >>>> >>>> Then you need to have one assembly file, which are arguably less clear >>>> than C code, and more likely to get it wrong. >>>> >>> >>> What? You have assembly files anyway in most C libraries.... >>> >> >> Yes, like, in linux, 8 of them ☺ >> > > Check out glibc. > > : chandra; find glibc-2.19 -name '*.[Ss]' | wc -l > 2061 > : chandra; > > One thing is tackling a port having to code a few boot files in assembly, >> and another one is to rewrite a bunch of assembly code, which could have >> been as efficiently written in C. >> > > But more to the point, I think you're fixating on the details and missing > the larger argument: one could write architecture-dependent C code link it > in a separate compilation module, or even include it textually for > inlining, without resorting to ifdefs. > > The point of conditionals in C code, is that you can have *one* C >>>> implementation, above the arch/ code, which, given the proper definitions >>>> (the whole two of them - which you never touch again, unless you are adding >>>> a new arch/, which is not something that happen very often), cover all the >>>> architectures. >>>> >>> >>> Incorrect, you have many C implementations, and which one gets selected >>> for presentation to the compiler depends on the values of various lexical >>> symbols in the preprocessor's environment. For this tiny example it may not >>> matter, but for evidence that this is regularly abused, just have a look at >>> glibc.... This has been a problem for literally decades: >>> https://www.usenix.org/legacy/publications/library/proceedings/sa92/spencer.pdf >>> >> >> Now think how it was with al the combinations being handled by assembly >> functions. >> > > Not necessarily. They *can* be C; assembler is often preferable, but not > strictly necessary. > > If a system has 100 valid combinations, you have to handle those 100 >> combinations. >> > > Ah, but ifdef's don't just cover the *valid* combinations and that's part > of the problem with them. Ifdefs allow you to introduce a tweak-able knob > that introduces a decision space much bigger than what's actually needed. > If I restrict myself only to boolean expression predicated on the existence > or lack thereof of a preprocessor symbol, then I have a number of > combinations that's exponential in the number of terms; for anything > non-trivial, that gets big fast. But probably only a handful of > combinations are actually meaningful. So the set I actually use is much > smaller than the decision space I've created. A classic problem with > preprocessor magic is what happens when I tweak the knobs to force a > decision that isn't handled in the code. This makes things fragile, and > really brittle to change. > > On the other hand, if I use separate compilation units then I can provide > exactly what I support and nothing more. > > Either you do it with Makefile magic (makefiles, which are driven by >> configs themselves - they are just called $(FOO)), or you do it with C >> pre-processing magic. >> > > Err, if by makefile magic you mean a directory name in a variable, then I > guess so.... I think history has shown again and again that that's much > cleaner than using the preprocessor. Plan 9 ran on a dozen architectures > without a single #ifdef related to portability. > > >> But since the compiler's demonstrably smart enough to figure it out >>> anyway, I don't see the point in NOT writing the completely portable code. >>> Rob was right: it just doesn't matter. >>> >> >> There seemed to be a disconnect about which compiler would fit that guy >> reasoning. >> > > Not really, no. The generated code here has been demonstrated to not make > any difference. > > >> Maybe one, which in honor of the pictured Schrodinger, is at the same >> time old and new ☺ >> > > But can't that same thing be said of the assumptions you are making about > hardware? You could apply a single, clean technique and get the same > performance as something that's specialized to each platform. There's a > tradeoff there: complexity of implementation vs. optimal performance. But > without the performance gain (which you don't get in this case) you just > have more complexity; that's objectively inferior. The *only* argument is > better code density, but even that's rendered moot in more recent version > of the compiler, and I'd have to see a compelling argument that shows that > it actually mattered to accept more complexity for less object code. > > - Dan C. > > -- You received this message because you are subscribed to the Google Groups "Akaros" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. For more options, visit https://groups.google.com/d/optout.
#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <stdlib.h>
const int ITER = 1000000000;
uint32_t
slow(unsigned char *p)
{
uint32_t t = 0;
t = p[0] << 0;
t |= p[1] << 8;
t |= p[2] << 16;
t |= p[3] << 24;
return t;
}
uint32_t
fast(unsigned char *p)
{
return *(uint32_t *)p;
}
unsigned char *cp;
int
main(int argc, char *argv[])
{
unsigned char *cp;
uint64_t t = 0;
if (argc != 3) {
fprintf(stderr, "usage: bench <data> <fast|slow>\n");
return EXIT_FAILURE;
}
cp = (unsigned char *)argv[1];
if (strcmp(argv[2], "slow") == 0)
for (int i = 0; i < ITER; i++) {
t += slow(argv[1]);
asm volatile ("" : : : "memory");
}
else
for (int i = 0; i < ITER; i++) {
t += fast(argv[1]);
asm volatile ("" : : : "memory");
}
printf("t %lld\n", t);
return 0;
}
bench.s
Description: Binary data
