Giuseppe Bilotta <giuseppe.bilo...@gmail.com> writes: > Hello, > > On Fri, Jun 5, 2015 at 2:22 PM, Francisco Jerez <curroje...@riseup.net> wrote: >> Giuseppe Bilotta <giuseppe.bilo...@gmail.com> writes: >>> >>> Ok, scratch that. I was confused by the fact that Beignet reports a >>> preferred work-group size multiple of 16. Intel IGPs support _logical_ >>> SIMD width of up to 32, but the _hardware_ SIMD width is just 4. So >>> the question is if here we should report the _hardware_ width, or the >>> maximum _logical_ width. >>> >> The physical SIMD width of any Intel GPU that as far as I'm aware ILO >> supports is 8, however, the hardware can execute 16- and in some cases >> 32-wide instructions by splitting them internally into instructions of >> the native SIMD width. > > Well, according to the Gen7.5 and 8 manuals I found on Intel's site, > it's actually 4, although with 2 FPUs. If the FPUs can execute > different (and independent) instructions, then the “lower SIMD limit” > would be 4, not 8, although in practice each execution unit has 8 PEs > available. >
That sounds roughly correct, but AFAIK before Gen8 there was only one real FPU per EU, the other "pipe" was the special function unit which could also process some normal arithmetic instructions, what in some (fairly restricted) cases allowed 8-wide execution of a single instruction or partial 4-wide execution of two instructions from different threads at the same time. In Gen8 what used to be the math pipe can in addition process general FPU instructions allowing 8-wide execution in more situations. In any case you are unlikely to get close to full utilization of the EU by doing 4-wide only, not only because of the cases you miss in which you could issue a single instruction 8-wide, but because of the fixed per-instruction overhead which is (at least) 2 cycles regardless of whether you are doing 4- or 8-wide -- We definitely don't want to encourage applications to use a work-group size of four, because it's inefficient. > [snip] > >> As this cap is just a performance hint, I think it makes sense to assume >> the best-case scenario as Grigori has done. If the driver later on >> decides it doesn't pay off to use the maximum SIMD width it can always >> use less, but using more may be difficult if the application didn't keep >> it in mind while choosing the workgroup layout. > > OTOH, at least in OpenCL, this cap wouldn't be used 'raw' as > performance hint, since the actual value returned (the > PREFERRED_WORK_GROUP_SIZE_MULTIPLE) is a kernel property rather than a > device property, so it may be tuned at kernel compilation time, > according to effective work-item SIMD usage. At least the way it's implemented in this series, it's a per-device property, and even though I see your point that it might be useful to have a finer-grained value in some cases, I don't think it's worth doing unless there is any evidence that the unlikely over-alignment of the work-group size will actually hurt performance for some application -- And there isn't at this point because ILO doesn't currently support OpenCL AFAIK. > In this sense I think the cap itself should be a 'lower limit', > i.e. the value under which the kernel simply cannot fully utilize the > hardware. > Yeah, and a kernel using less than SIMD16 will most likely be unable to fully utilize the hardware due to the pipeline stalls and issue overhead, so I think it's the lower limit you're looking for. > IOW, I believe that if a larger group size than the physical SIMD > width is needed for a specific kernel to fully utilize the hardware, > this should be handled higher up in the stack, not at the level of > this cap, I don't think it can be handled higher up than in the pipe driver, only the pipe driver has the hardware-specific knowledge required to answer the question of which is the best work-group multiple for some specific device. > since the value here is is going to be manipulated _anyway_ > (e.g. a kernel written for float16 might even end up recommending a > work-group size multiple of 1, as an extreme example). > On Intel hardware a kernel using 16-vectors would typically be run scalarized with one SIMD channel per logical thread, so the driver would still want the work-group size to be a multiple of 16. > -- > Giuseppe "Oblomov" Bilotta
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev