2016-04-10 3:34 GMT+03:00 David Guillen Fandos <da...@davidgf.net>: > On 07/04/16 09:09, Ilya Enkovich wrote: >> 2016-04-07 0:49 GMT+03:00 David Guillen Fandos <da...@davidgf.net>: >>> >>> Thanks a lot Ilya! >>> >>> I managed to get it working. There were some bugs regarding register >>> allocation that ended up promoting the class to be BLKmode instead of >>> V4SFmode. I had to debug it a bit, which is tricky, but in the end I >>> found my way through it. >>> >>> Just to finish this. Do you think from your experience that is difficult >>> to implement vector instructions that have variable sizes? >> >> Having implemented instruction in some mode you shouldn't have much trouble >> to extend it into other mode using mode iterators. There are a lot of >> examples in GCC. >> >>> This >>> particular VFU has 4, 3, 2 and 1 element operations with arbitrary >>> swizzling. This is, we can load a V3SF and perform a dot product >>> operation with another V3SF to get a V1SF for instance. Of course the >>> elements might overlap, so if a vreg is A B C D we can have a 4 element >>> vector ABCD or a pair of 3 element vregs ABC and BCD, the same logic >>> applies to have 3 registers of V2SF type and so forth. It is very >>> flexible. It also allows column and row arranging, so we can load 4 >>> vectors in a 4x4 matrix and multiply them with another matrix >>> transposing them on the fly. >> >> Unfortunately GCC doesn't expect vector to have not a power of two >> number of elements. Thus you can't write >> >> float var __attribute__ ((vector_size (12))); >> >> and expect it to get V3SF mode. >> >> >> Target instruction set doesn't affect a way vector code is represented >> in GIMPLE. It means complex instructions like matrix multiplication >> don't have expressions with corresponding semantics and can't be >> just generated out of a single GIMPLE statement. >> >> You still may get advantage of your ISA when expand vector code. >> E.g. vec_extract_[lo|hi] may be expanded into simple SUBREG in your case. >> Advanced vector instructions may be generated by RTL optimizers. E.g. >> combine may merge few vector instructions into a single one. >> >>> >>> I guess this is too difficult to expose to gcc, which is more used to >>> intel SIMD stuff. In the past I wrote most of the kernels in assembly >>> and wrap them around C functions, but if you use classes and inline >>> functions having gcc on your side helps a lot (register allocation and >>> therefore less load/stores to memory). >> >> There are instructions which are never generated by compiler and exist >> mostly to be used manually. AES instruction set is a good example of such >> instructions. Intrinsics (builtin functions) is a better alternative to >> assembler code to manually write vector code with such instructions. >> Using intrinsics you get register allocation and RTL optimizations working. >> >> Ilya >> >>> >>> Thanks a lot for your help! >>> >>> David >>> >>> > > Cool I wasnt aware of some things you mentinon. > To be a bit more especific: > > - How would you define a template that takes 2 V4SF, calculates the dot > product and outputs a SF that is a subreg of a V4SF? This is, the > operation could be any of the four: > > r.x = a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w; > > or > > r.y = a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w; > > and so forth. > The idea would be to tell gcc that a V4SF has 4 SF that he can address > as subregs and define operations like the dot product one.
You can use vec_select to get vector elements and compute sum. Then you can use vec_concat or vec_merge to build up resulting vector. I would not expect GCC to autogenerate this instruction though. > It's a pain not to have V3SF though... AVX-512 instructions use masks to perform operation on vector parts. vec_merge is used to describe that in patterns. Probably it will be easier to consider V3SF instruction as V4SF instruction with mask applied? Ilya > > Thanks a lot again! > David