Oops, I forgot to attach the assembly generated by gcc-4.9.2. My bad. And
doubly my bad: I was straight up wrong; the slow version is much slower
(1.2s vs .3s with -Ofast). I had screwed up the conditionals that chose
which branch to run. Well, I gotta admit it if I'm wrong.... But I think
the other points still stand.

On Tue, Dec 1, 2015 at 7:46 AM, Dan Cross <[email protected]> wrote:

> On Tue, Dec 1, 2015 at 12:35 AM, 'Davide Libenzi' via Akaros <
> [email protected]> wrote:
>
>> On Mon, Nov 30, 2015 at 8:07 PM, Dan Cross <[email protected]> wrote:
>>
>>> On Sun, Nov 29, 2015 at 8:04 PM, 'Davide Libenzi' via Akaros <
>>> [email protected]> wrote:
>>>
>>>> On Sun, Nov 29, 2015 at 2:28 PM, Dan Cross <[email protected]> wrote:
>>>>
>>>>> But very often these functions are on the boundary of a program; a
>>>>> place where performance doesn't matter that much.
>>>>>
>>>>
>>>> Agreed, in this case does not matter much.
>>>> But that code has two issues. Performance, and use macros when static
>>>> inlines are clearly the better choice (PBIT32(buf,
>>>> some_side_effect_or_slow_func(..)) anybody?).
>>>>
>>>
>>> So in fairness, that code was originally written well before 'static
>>> inline' was a thing.
>>>
>>
>> They were certainly here in 2012 😉
>> TUESDAY, APRIL 03, 2012
>>
>
> ...but Rob didn't write any macros in that blog post....
>
>
>> On my Mac, using gcc-5.2, I get the exact same assembly code for both
>>> forms. See attached; the .s file was produced by, "gcc-mp-5 -Os -S bench.c".
>>>
>>
>> Yes, *5.2* does that.
>> Barrett showed up some code which looked like hell, plus flying sharks
>> armed with RPG and AKs, while my GCC (4.8.2 - AKA 2014) generated code
>> which simply looked like hell.
>> So, which one is it? Was it meant for old compilers (like above), or was
>> it meant for compilers which showed up in 2015? ☺
>>
>
> Let's run it through gcc-4.9.2 (see attached: gcc-mp-4.9 -std=c11 -fasm
> -Os -S bench.c). Yeah, the assembly is not as pretty, but did you actually
> measure the elapsed runtime? They appear to be about the same to me:
>
> : hurricane; time ./bench abcd fast
> t 1684234849000000000
>
> real 0m0.339s
> user 0m0.333s
> sys 0m0.003s
> : hurricane; time ./bench abcd slow
> t 1684234849000000000
>
> real 0m0.334s
> user 0m0.328s
> sys 0m0.003s
> : hurricane;
>
> Further, you're arguing for a technique based on hardware that didn't make
> this fast until pretty recently (I can't remember when unaligned access
> became fast in x86). Sure, the object code is a bit bigger (4 words instead
> of 2 bytes) so it takes up more space in icache, but for something this
> small, I don't think it matters. Moral: measure, but only when it can be
> shown that it's important.
>
> Even in ARM, ARM64 that is (the only thing that matters, eventually, for
>>>> Akaros), a single load/store is faster than open coding.
>>>> Unaligned faulting (or sucking) junk is thing of the past. Processors
>>>> doing that are either dead, or turning around with new silicon versions.
>>>>
>>>
>>> That's a dangerous assumption, and as there's clearly no harm in writing
>>> it the portable way since I get the same output anyway, I don't see a point
>>> in making the assumption.
>>>
>>
>> Note that nobody was trying to push anything which wasn't portable. You
>> came up with the assembly thing.
>>
>
> '*(uint32_t *)p;' isn't portable because of alignment issues (unless you
> can guarantee that p always points to properly aligned data). Sure, you can
> wrap that up in an 'ifdef' so that you don't compile it on a system where
> alignment is important, but the code itself is still inherently unportable.
> ifdef'ing it out or handwaving away platforms where it matters doesn't
> really change that. I'd rather just write one version of the code that's
> portable.
>
> I think the overarching point of Rob's post was that if a programmer feels
>>>>> like s/he needs to write something to deal with endianness of the machine
>>>>> one is on, one's almost certainly going to be wrong.
>>>>>
>>>>
>>>> Really? And who's this guy? Anyone I can recognize here?😀
>>>>
>>>
>>> Rob Pike? No, he's not one of the scientists in that picture (cool
>>> picture by the way). But he is this guy:
>>> https://en.wikipedia.org/wiki/The_Unix_Programming_Environment
>>> https://en.wikipedia.org/wiki/The_Practice_of_Programming
>>>
>>
>> I will always be taking hard shots to the guys which assume that either
>> "other people will get it wrong", or, along the same lines, "other people
>> will fail because they failed".
>>
>
> ...but he didn't fail at anything. His point is absolutely correct.
>
> But now one is resorting to tricks with conditional compilation (which is
>>>>> kind of Rob's point), which gives rise to a combinatorial explosion of
>>>>> options that I can't compile and test everywhere.
>>>>>
>>>>
>>>> So instead of LE && FAST_UNALIGNED one has
>>>>> /sys/src/libc/x86_64/accessdata.s that has things like,
>>>>>
>>>>> ; X
>>>>> .globl gets32le
>>>>>  gets32le:
>>>>>         movl (%rdi),%eax
>>>>>         ret
>>>>>
>>>>> Then I just link against the library that's right for my platform. No
>>>>> #ifdef's in site.
>>>>>
>>>>
>>>> Then you need to have one assembly file, which are arguably less clear
>>>> than C code, and more likely to get it wrong.
>>>>
>>>
>>> What? You have assembly files anyway in most C libraries....
>>>
>>
>> Yes, like, in linux, 8 of them ☺
>>
>
> Check out glibc.
>
> : chandra; find glibc-2.19 -name '*.[Ss]' | wc -l
>     2061
> : chandra;
>
> One thing is tackling a port having to code a few boot files in assembly,
>> and another one is to rewrite a bunch of assembly code, which could have
>> been as efficiently written in C.
>>
>
> But more to the point, I think you're fixating on the details and missing
> the larger argument: one could write architecture-dependent C code link it
> in a separate compilation module, or even include it textually for
> inlining, without resorting to ifdefs.
>
> The point of conditionals in C code, is that you can have *one* C
>>>> implementation, above the arch/ code, which, given the proper definitions
>>>> (the whole two of them - which you never touch again, unless you are adding
>>>> a new arch/, which is not something that happen very often), cover all the
>>>> architectures.
>>>>
>>>
>>> Incorrect, you have many C implementations, and which one gets selected
>>> for presentation to the compiler depends on the values of various lexical
>>> symbols in the preprocessor's environment. For this tiny example it may not
>>> matter, but for evidence that this is regularly abused, just have a look at
>>> glibc.... This has been a problem for literally decades:
>>> https://www.usenix.org/legacy/publications/library/proceedings/sa92/spencer.pdf
>>>
>>
>> Now think how it was with al the combinations being handled by assembly
>> functions.
>>
>
> Not necessarily. They *can* be C; assembler is often preferable, but not
> strictly necessary.
>
> If a system has 100 valid combinations, you have to handle those 100
>> combinations.
>>
>
> Ah, but ifdef's don't just cover the *valid* combinations and that's part
> of the problem with them. Ifdefs allow you to introduce a tweak-able knob
> that introduces a decision space much bigger than what's actually needed.
> If I restrict myself only to boolean expression predicated on the existence
> or lack thereof of a preprocessor symbol, then I have a number of
> combinations that's exponential in the number of terms; for anything
> non-trivial, that gets big fast. But probably only a handful of
> combinations are actually meaningful. So the set I actually use is much
> smaller than the decision space I've created. A classic problem with
> preprocessor magic is what happens when I tweak the knobs to force a
> decision that isn't handled in the code. This makes things fragile, and
> really brittle to change.
>
> On the other hand, if I use separate compilation units then I can provide
> exactly what I support and nothing more.
>
> Either you do it with Makefile magic (makefiles, which are driven by
>> configs themselves - they are just called $(FOO)), or you do it with C
>> pre-processing magic.
>>
>
> Err, if by makefile magic you mean a directory name in a variable, then I
> guess so.... I think history has shown again and again that that's much
> cleaner than using the preprocessor. Plan 9 ran on a dozen architectures
> without a single #ifdef related to portability.
>
>
>> But since the compiler's demonstrably smart enough to figure it out
>>> anyway, I don't see the point in NOT writing the completely portable code.
>>> Rob was right: it just doesn't matter.
>>>
>>
>> There seemed to be a disconnect about which compiler would fit that guy
>> reasoning.
>>
>
> Not really, no. The generated code here has been demonstrated to not make
> any difference.
>
>
>> Maybe one, which in honor of the pictured Schrodinger, is at the same
>> time old and new ☺
>>
>
> But can't that same thing be said of the assumptions you are making about
> hardware? You could apply a single, clean technique and get the same
> performance as something that's specialized to each platform. There's a
> tradeoff there: complexity of implementation vs. optimal performance. But
> without the performance gain (which you don't get in this case) you just
> have more complexity; that's objectively inferior. The *only* argument is
> better code density, but even that's rendered moot in more recent version
> of the compiler, and I'd have to see a compelling argument that shows that
> it actually mattered to accept more complexity for less object code.
>
>         - Dan C.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.
#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <stdlib.h>

const int ITER = 1000000000;

uint32_t
slow(unsigned char *p)
{
	uint32_t t = 0;

	t  = p[0] <<  0;
	t |= p[1] <<  8;
	t |= p[2] << 16;
	t |= p[3] << 24;

	return t;
}

uint32_t
fast(unsigned char *p)
{
	return *(uint32_t *)p;
}

unsigned char *cp;

int
main(int argc, char *argv[])
{
	unsigned char *cp;
	uint64_t t = 0;

	if (argc != 3) {
		fprintf(stderr, "usage: bench <data> <fast|slow>\n");
		return EXIT_FAILURE;
	}
	cp = (unsigned char *)argv[1];
	if (strcmp(argv[2], "slow") == 0)
		for (int i = 0; i < ITER; i++) {
			t += slow(argv[1]);
			asm volatile ("" : : : "memory");
		}
	else
		for (int i = 0; i < ITER; i++) {
			t += fast(argv[1]);
			asm volatile ("" : : : "memory");
		}
	printf("t %lld\n", t);

	return 0;
}

Attachment: bench.s
Description: Binary data

Reply via email to