Re: [akaros] perfmon read/write interface

Dan Cross Tue, 01 Dec 2015 04:47:21 -0800

On Tue, Dec 1, 2015 at 12:35 AM, 'Davide Libenzi' via Akaros <
[email protected]> wrote:

> On Mon, Nov 30, 2015 at 8:07 PM, Dan Cross <[email protected]> wrote:
>
>> On Sun, Nov 29, 2015 at 8:04 PM, 'Davide Libenzi' via Akaros <
>> [email protected]> wrote:
>>
>>> On Sun, Nov 29, 2015 at 2:28 PM, Dan Cross <[email protected]> wrote:
>>>
>>>> But very often these functions are on the boundary of a program; a
>>>> place where performance doesn't matter that much.
>>>>
>>>
>>> Agreed, in this case does not matter much.
>>> But that code has two issues. Performance, and use macros when static
>>> inlines are clearly the better choice (PBIT32(buf,
>>> some_side_effect_or_slow_func(..)) anybody?).
>>>
>>
>> So in fairness, that code was originally written well before 'static
>> inline' was a thing.
>>
>
> They were certainly here in 2012 😉
> TUESDAY, APRIL 03, 2012
>

...but Rob didn't write any macros in that blog post....

> On my Mac, using gcc-5.2, I get the exact same assembly code for both
>> forms. See attached; the .s file was produced by, "gcc-mp-5 -Os -S bench.c".
>>
>
> Yes, *5.2* does that.
> Barrett showed up some code which looked like hell, plus flying sharks
> armed with RPG and AKs, while my GCC (4.8.2 - AKA 2014) generated code
> which simply looked like hell.
> So, which one is it? Was it meant for old compilers (like above), or was
> it meant for compilers which showed up in 2015? ☺
>

Let's run it through gcc-4.9.2 (see attached: gcc-mp-4.9 -std=c11 -fasm -Os
-S bench.c). Yeah, the assembly is not as pretty, but did you actually
measure the elapsed runtime? They appear to be about the same to me:

: hurricane; time ./bench abcd fast
t 1684234849000000000

real 0m0.339s
user 0m0.333s
sys 0m0.003s
: hurricane; time ./bench abcd slow
t 1684234849000000000

real 0m0.334s
user 0m0.328s
sys 0m0.003s
: hurricane;

Further, you're arguing for a technique based on hardware that didn't make
this fast until pretty recently (I can't remember when unaligned access
became fast in x86). Sure, the object code is a bit bigger (4 words instead
of 2 bytes) so it takes up more space in icache, but for something this
small, I don't think it matters. Moral: measure, but only when it can be
shown that it's important.

Even in ARM, ARM64 that is (the only thing that matters, eventually, for
>>> Akaros), a single load/store is faster than open coding.
>>> Unaligned faulting (or sucking) junk is thing of the past. Processors
>>> doing that are either dead, or turning around with new silicon versions.
>>>
>>
>> That's a dangerous assumption, and as there's clearly no harm in writing
>> it the portable way since I get the same output anyway, I don't see a point
>> in making the assumption.
>>
>
> Note that nobody was trying to push anything which wasn't portable. You
> came up with the assembly thing.
>

'*(uint32_t *)p;' isn't portable because of alignment issues (unless you
can guarantee that p always points to properly aligned data). Sure, you can
wrap that up in an 'ifdef' so that you don't compile it on a system where
alignment is important, but the code itself is still inherently unportable.
ifdef'ing it out or handwaving away platforms where it matters doesn't
really change that. I'd rather just write one version of the code that's
portable.

I think the overarching point of Rob's post was that if a programmer feels
>>>> like s/he needs to write something to deal with endianness of the machine
>>>> one is on, one's almost certainly going to be wrong.
>>>>
>>>
>>> Really? And who's this guy? Anyone I can recognize here?😀
>>>
>>
>> Rob Pike? No, he's not one of the scientists in that picture (cool
>> picture by the way). But he is this guy:
>> https://en.wikipedia.org/wiki/The_Unix_Programming_Environment
>> https://en.wikipedia.org/wiki/The_Practice_of_Programming
>>
>
> I will always be taking hard shots to the guys which assume that either
> "other people will get it wrong", or, along the same lines, "other people
> will fail because they failed".
>

...but he didn't fail at anything. His point is absolutely correct.

But now one is resorting to tricks with conditional compilation (which is
>>>> kind of Rob's point), which gives rise to a combinatorial explosion of
>>>> options that I can't compile and test everywhere.
>>>>
>>>
>>> So instead of LE && FAST_UNALIGNED one has
>>>> /sys/src/libc/x86_64/accessdata.s that has things like,
>>>>
>>>> ; X
>>>> .globl gets32le
>>>>  gets32le:
>>>>         movl (%rdi),%eax
>>>>         ret
>>>>
>>>> Then I just link against the library that's right for my platform. No
>>>> #ifdef's in site.
>>>>
>>>
>>> Then you need to have one assembly file, which are arguably less clear
>>> than C code, and more likely to get it wrong.
>>>
>>
>> What? You have assembly files anyway in most C libraries....
>>
>
> Yes, like, in linux, 8 of them ☺
>

Check out glibc.

: chandra; find glibc-2.19 -name '*.[Ss]' | wc -l
    2061
: chandra;

One thing is tackling a port having to code a few boot files in assembly,
> and another one is to rewrite a bunch of assembly code, which could have
> been as efficiently written in C.
>

But more to the point, I think you're fixating on the details and missing
the larger argument: one could write architecture-dependent C code link it
in a separate compilation module, or even include it textually for
inlining, without resorting to ifdefs.

The point of conditionals in C code, is that you can have *one* C
>>> implementation, above the arch/ code, which, given the proper definitions
>>> (the whole two of them - which you never touch again, unless you are adding
>>> a new arch/, which is not something that happen very often), cover all the
>>> architectures.
>>>
>>
>> Incorrect, you have many C implementations, and which one gets selected
>> for presentation to the compiler depends on the values of various lexical
>> symbols in the preprocessor's environment. For this tiny example it may not
>> matter, but for evidence that this is regularly abused, just have a look at
>> glibc.... This has been a problem for literally decades:
>> https://www.usenix.org/legacy/publications/library/proceedings/sa92/spencer.pdf
>>
>
> Now think how it was with al the combinations being handled by assembly
> functions.
>

Not necessarily. They *can* be C; assembler is often preferable, but not
strictly necessary.

If a system has 100 valid combinations, you have to handle those 100
> combinations.
>

Ah, but ifdef's don't just cover the *valid* combinations and that's part
of the problem with them. Ifdefs allow you to introduce a tweak-able knob
that introduces a decision space much bigger than what's actually needed.
If I restrict myself only to boolean expression predicated on the existence
or lack thereof of a preprocessor symbol, then I have a number of
combinations that's exponential in the number of terms; for anything
non-trivial, that gets big fast. But probably only a handful of
combinations are actually meaningful. So the set I actually use is much
smaller than the decision space I've created. A classic problem with
preprocessor magic is what happens when I tweak the knobs to force a
decision that isn't handled in the code. This makes things fragile, and
really brittle to change.

On the other hand, if I use separate compilation units then I can provide
exactly what I support and nothing more.

Either you do it with Makefile magic (makefiles, which are driven by
> configs themselves - they are just called $(FOO)), or you do it with C
> pre-processing magic.
>

Err, if by makefile magic you mean a directory name in a variable, then I
guess so.... I think history has shown again and again that that's much
cleaner than using the preprocessor. Plan 9 ran on a dozen architectures
without a single #ifdef related to portability.

> But since the compiler's demonstrably smart enough to figure it out
>> anyway, I don't see the point in NOT writing the completely portable code.
>> Rob was right: it just doesn't matter.
>>
>
> There seemed to be a disconnect about which compiler would fit that guy
> reasoning.
>

Not really, no. The generated code here has been demonstrated to not make
any difference.

> Maybe one, which in honor of the pictured Schrodinger, is at the same time
> old and new ☺
>

But can't that same thing be said of the assumptions you are making about
hardware? You could apply a single, clean technique and get the same
performance as something that's specialized to each platform. There's a
tradeoff there: complexity of implementation vs. optimal performance. But
without the performance gain (which you don't get in this case) you just
have more complexity; that's objectively inferior. The *only* argument is
better code density, but even that's rendered moot in more recent version
of the compiler, and I'd have to see a compelling argument that shows that
it actually mattered to accept more complexity for less object code.

        - Dan C.

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [akaros] perfmon read/write interface

Reply via email to