On Fri, Jun 06, 2008 at 10:28:34AM +0200, Jan Hubicka wrote:
> > 
> > ymm0 and xmm0 are the same register. xmm0 is the lower 128bit
> > of xmm0. I am not sure if we need separate XMM registers from
> > YMM registers.
> 
> 
> Yes, I know that xmm0 is lower part of ymm0.  I still think we ought to
> be able to support varargs that do save ymm0 registers only when ymm
> values are passed same way as we touch SSE only when SSE values are
> passed via EAX hint.

Which register do you propose for hint? The current psABI uses RAX
for XMM registers. We can't change it to AL and AH for YMM without
breaking backward compatibility.

> This way we will be able to support e.g. printf that has YMM printing %
> construct but don't need YMM enabled hardware when those are not used.
> 
> This is why I think extending EAX to contain information about amount of
> XMM values to save and in addition YMM values to save is sane.  Then old
> non-YMM aware varargs prologues will crash when YMM values are passed,
> but all other combinations will work.

I don't think it is necessary since -mavx will enable AVX code
generation for all SSE codes. Unless the function only uses integer,
it will crash on non-YMM aware hardware.  That is if there is one
SSE register is used, which is hinted in RAX, varargs prologue will
use AVX instructions to save it. We don't need another hint for AVX
instructions.

> > 
> > >
> > > I personally don't have much preferences over 1. or 2.. 1. seems
> > > relatively easy to implement too, or is packaging two 128bit values to
> > > single 256bit difficult in va_arg expansion?
> > >
> > 
> > Access to 256bit register as lower and upper 128bits needs 2
> > instructions. For store
> > 
> > vmovaps   %xmm7, -143(%rax)
> > vextractf128 $1, %ymm7, -15(%rax)
> > 
> > For load
> > 
> > vmovaps  -143(%rax),%xmm7
> > vinsert128 $1, -15(%rax),%ymm7,%ymm7
> > 
> > If we go beyond 256bit, we need more instructions to access
> > the full register. For 512bit, it will be split into lower 128bit,
> > middle 128bit and upper 256bit. 1024bit will have 4 parts.
> > 
> > For #2, only one instruction will be needed for 256bit and
> > beyond.
> 
> Yes, but we will still save half of stack space.  Well, I don't have
> much preferences here.  If it seems saner to simply save whole thing
> saving lower part twice, I am fine with that.

I was told that it wasn't very easy to get decent performance with
split access. I extended my proposal to include a 16bit bitmask to
indicate which YMM regisetrs should be saved. If the bit is 0,
we should only save the the lower 128bit in the original register
save area. Otherwise, we should only save the same whole YMM register.


H.J.
----
x86-64 psABI defines

typedef struct
{
  unsigned int gp_offset;
  unsigned int fp_offset;
  void *overflow_arg_area;
  void *reg_save_area;
} va_list[1];

for variable argument list. "va_list" is used to access variable argument
list:

void
bar (const char *format, va_list ap)
{
  if (va_arg (ap, int) != 0)
    abort ();
}

void
foo(char *fmt, ...)
{
  va_list ap;
  va_start (fmt, ap); 
  bar (fmt, ap);
  va_end (ap);
}

foo and bar may be compiled with different compilers. We have to keep
the current layout for va_list so that we can mix va_list codes compiled
with AVX and non-AVX compilers. We need to extend the variable argument
handling in the x86-64 psABI to support passing __m256/__m256d/__m256i
on the variable argument list. We propose 2 ways to extend the register
save area to add 256bit AVX registers support:

1. Extend the register save area to put upper 128bit at the end.
  Pros: 
    Aligned access.
    Save stack space if 256bit registers are used. 
  Cons 
    Split access. Require more split access beyond 256bit.

2. Extend the register save area to put full 265bit YMMs at the end.
The first DWORD after the register save area has the offset of the
extended array for YMM registers from the start of the register save
area. The next DWORD has the element size of the extended array.  The
next WORD encodes which YMM registers should be saved.  Unaligned access
will be used.

The             Offset  Register 
original        0       %rdi
register        8       %rsi
save            16      %rdx
area            24      %rcx
                32      %r8
                40      %r9
                48      %xmm0
                64      %xmm1
                ...
                288     %xmm15
Hints           304     320     offset from offset 0.
                308     32      size of element
                312     bitmask for used YMM registers
                314     Unused
Extended        320     %ymm0
array for       352     %ymm1
YMM             ...
registers       800     %ymm15

  Pros: 
    No split access.
    Easily extendable beyond 256bit.
    Limited unaligned access penalty if stack is aligned at 32byte.
  Cons:
    May require store both the lower 128bit and full 256bit register
    content. We may avoid saving the lower 128bit if correct type
    is required when accessing variable argument list, similar to int
    vs. double.
    Waste 272 byte on stack when 256bit registers are used.
    Unaligned load and store.

We should agree on one approach to ensure compatibility between
different compilers.

Reply via email to