Hi all,

I've hit up against a RTL generation problem for which I'm struggling to come up with a good solution.

TL:DR: I need a way to express "the highpart of every lane in a vector", and "(subreg:V64SI (reg:V64DI 123) 4)" is *not* it.

I can insert temporary values and stick them together with unspecs, but that's ugly and doesn't give optimal register allocation.

Any help or suggestions would be appreciated.

Thanks

Andrew



Long version:

I have a scalar insn that looks like this:

  (set (subreg:SI (reg:DI 123) 4)
       (whatever:SI))

So it's trying to write to the highpart of a DImode register. This is a common thing because this architecture (amdgcn) forms DImode values from pairs of SImode registers. After reload this will be a simple register assignment.

And I want to convert it to a vector instruction that does exactly the same thing, but 64 times in parallel:

  (set (subreg:V64SI (reg:V64DI 123) 4)
       (whatever:V64SI))

On amdgcn V64DImode values are formed from pairs of V64SImode registers, such that the all the low parts are in one register, and all the high parts are in the next register, so again, after reload this will be a simple register assignment.

Except it isn't, because the SUBREG actually means a group of consecutive lanes starting at the high part of lane 0, which is not what I want. The compiler ends up writing the whole vector to the stack, and reloading some of it in an entirely broken bitwise reinterpretation of the data.

I have a temporary solution that looks like this:

  (set (reg:V64SI 789)
       (whatever:V64SI))

  (set (reg:V64DI 123)
       (unspec:V64DI
          [(reg:V64DI 123)
           (reg:V64SI 789)
           (1)]
          UNSPEC_VSUBREG))

This is logically correct, but there's really no way to ensure that the temporary register is in the high part position (and amdgcn does require that the two parts be consecutive) so I end up with extra moves, plus of course the inscrutable UNSPEC. (It's easier for the low part, I think, where the hardreg number will match, but I didn't try to optimize that either.)

Another solution might be to use a PARALLEL, to list the scalar operation for every lane individually, but that's just getting silly, and I don't think the compiler will actually make any more sense of it.

I think, ideally, there'd be a "VSUBREG" operator that works on vector lanes identically to how SUBREG does for scalars, or maybe the mode of the regular SUBREG could be configurable per architecture, but adding that will mean touching a lot of things and probably isn't practical.

Is there already a way to represent what I need so that LRA will substitute the high part register as I need it?

Thanks

Andrew

Reply via email to