Right - we can always use calloc and then provide aligned memory. Perhaps this is worth benchmarking. It is still likely to be much faster than malloc + memset, because it should have significantly better cache behaviour, even though the zeroing is not free. The question is, whether this cost is small enough.
-viral > On 25-Nov-2014, at 9:45 am, Stefan Karpinski <[email protected]> wrote: > > That's not the point – if you already have memory and have to fill it, then > you're not in any position for the kernel to lazily zero it, so the alignment > of arbitrary arrays is irrelevant. The point SGJ was making is that we want > to allocate the memory using something calloc-like so that the kernel can do > lazy zeroing for us, but we also need that memory to be 16-byte aligned, but > there is not portable way to get 16-byte-aligned memory that the kernel will > lazily zero for you. We can have lazy zeroing or 16-byte alignment but not > both. This makes me wonder if we couldn't just allocate 15 bytes more than > necessary and return the first address that on a 16-byte boundary. > > On Mon, Nov 24, 2014 at 11:02 PM, Viral Shah <[email protected]> wrote: > To add to the point, you can also get non-aligned stuff with subarrays or > results from a ccall. > > -viral > > > On Tuesday, November 25, 2014 9:24:36 AM UTC+5:30, Simon Kornblith wrote: > In general, arrays cannot be assumed to be 16-byte aligned because it's > always possible to create one that isn't using pointer_to_array. However, > from Intel's AVX introduction: > > Intel® AVX has relaxed some memory alignment requirements, so now Intel AVX > by default allows unaligned access; however, this access may come at a > performance slowdown, so the old rule of designing your data to be memory > aligned is still good practice (16-byte aligned for 128-bit access and > 32-byte aligned for 256-bit access). > > On Monday, November 24, 2014 10:01:45 PM UTC-5, Erik Schnetter wrote: > On Mon, Nov 24, 2014 at 9:30 PM, Steven G. Johnson > <[email protected]> wrote: > > Unfortunately, Julia allocates 16-byte aligned data by default (to help > > SIMD > > code), and there is no calloc version of posix_memalign as far as I know. > > The generated machine code I've seen does not make use of this. All > the load/store instructions in vectorized or unrolled loops assume > unaligned pointers. (Plus, with AVX one should align to 32 bytes > instead.) > > -erik > > -- > Erik Schnetter <[email protected]> > http://www.perimeterinstitute.ca/personal/eschnetter/ >
