On Wednesday, 4 November 2015 at 01:14:31 UTC, Nicholas Wilson wrote:
Note that there are two different alignments:
to control padding between instances on the stack (arrays)
         to control padding between members of a struct

align(64) //arrays
struct foo
{
      align(16) short baz; //between members
      align (1) float quux;
}

your 2.5x speedup is due to aligned vs. unaligned loads and stores which for SIMD type stuff has a really big effect. Basically misaligned stuff is really slow. IIRC there was a (blog/paper?) of someone on a uC spending a vast amount of time in ONE misaligned integer assignment causing traps and getting the kernel involved. Not quite as bad on x86 but still with doing.

As to a less jacky solution I'm not sure there is one.

Thanks for the reply. I did some more checking around and I found that it was not really an alignment problem but was caused by using the default init value of my type.

My starting type.
align(64) struct Phys
{
   float x, y, z, w;
   //More stuff.
} //Was 64 bytes in size at the time.

The above worked fine, it was fast and all. But after a while I wanted the data in a diffrent format. So I started decoding positions, and other variables in separate arrays.

Something like this:
align(16) struct Pos { float x, y, z, w; }

This counter to my limited knowledge of how cpu's work was much slower. Doing the same thing lot's of times, touching less memory with less branches should in theory at-least be faster right? So after I ruled out bottlenecks in the parser I assumed there was some alignment problems so I did my Aligner hack. This caused to code to run faster so I assumed this was the cause... Naive! (there was a typo in the code I submitted to begin with I used a = Align!(T).init and not a.value = T.init)

The performance was actually cased by the line : t = T.init no matter if it was aligned or not. I solved the problem by changing the struct to look like this.
align(16) struct Pos
{
    float x = float.nan;
    float y = float.nan;
    float z = float.nan;
    float w = float.nan;
}

Basically T.init get's explicit values. But... this should be the same Pos.init as the default Pos.init. So I really fail to understand how this could fix the problem. I guessed the compiler generates some slightly different code if I do it this way? And that this slightly different code fixes some bottleneck in the cpu. But when I took a look at the assembly of the function I could not find any difference in the generated code...

I don't really know where to go from here to figure out the underlying cause. Does anyone have any suggestions?






Reply via email to