On Mon, May 4, 2020 at 5:09 PM Marek Olšák <mar...@gmail.com> wrote: > > 16-bit varyings only make sense if they are packed, i.e. we need to fit 2 > 16-bit 4D varyings into 1 vec4 slot to save memory for IO. Without that, AMD > (and most others?) won't benefit from 16-bit IO much. >
I guess for !flat varyings that mostly makes sense if you are manually interpolating in the fs? We can, but don't have to and it doesn't seem like a benefit to do so. Maybe it would be a win for flat varyings, but unclear, we might win more from switching to the instruction we use for interpolated varyings instead of the one that bypasses interpolation. At least that seems to be what blob does on new gens. > 16-bit uniforms would help everybody, because there is potential for uniform > packing, saving memory (and cache lines). > it does mean futzing w/ uniforms before uploading.. I'm not sure (for us) that is a win vs just using the hw builtin automagic fp32->fp16 push-constant conversion.. the push constant upload is pipelined with draws afaict for newer gens, and from shader standpoint, other than the restrictions about which instructions can use const src and when, they are basically free to load.. ie. loading cN.m as hcN.m is free. so might also what to be a driver option? > The other items are just for eliminating conversion instructions. We must > have more vectorized 16-bit vec2 instructions than "conversion instructions + > vec2 packing instructions" for mediump to pay off. We also don't get > decreased register usage if we are not vectorized, so mediump is a tough sell > at the moment. we don't really have "vectorized fp16".. we have a sort of "vectorish" mode where a scalar instruction can repeat, incrementing dst register and optionally incrementing individual src registers (ie. we can do .yyy or .yzw swizzles but not others). That is orthogonal to fp16 (but there may be lower latency for fp16) and mostly seems to help reducing the latency to load src registers (since hw can load a non-incremented src register once for each of the scalar instructions packed together). Scalar 16b instructions might be a win, but it is a bit more complicated to tease out the instruction cycles vs the register load cost. balancing register pressure vs "vectorish" instructions is a thing I'm still working on. But ignoring that fp16 is a win for us because of register pressure.. ie. a full-reg conflicts with two half-regs. For sure, a lot of the gain involves avoiding excessive conversions, but in a lot of common cases we can fold conversion into alu instruction in the backend.. BR, -R > > Marek > > On Mon, May 4, 2020 at 7:03 PM Rob Clark <robdcl...@gmail.com> wrote: >> >> On Mon, May 4, 2020 at 11:44 AM Marek Olšák <mar...@gmail.com> wrote: >> > >> > Hi, >> > >> > This is the status of mediump support in Mesa. What I listed is what AMD >> > GPUs can do. "Yes" means what Mesa supports. >> > >> > Feature FP16 support Int16 support >> > ALU Yes No >> > Uniforms No No >> > VS in No No >> > VS out / FS in No No >> > FS out No No >> > TCS, TES, GS out / in No No >> > Sampler coordinates (only coord, derivs, lod, bias; not offset and >> > compare) No --- >> > Image coordinates --- No >> > Return value from samplers (incl. sampler buffers) Yes >> > No >> > Return value from image loads (incl. image buffers) No No >> > Data source for image stores (incl. image buffers) No No >> > If 16-bit sampler/image instructions are surrounded by conversions, >> > promote them to 32 bits No No >> > >> > Please let me know if you don't see the table correctly. >> > >> > I'd like to know if I can enable some of them using the existing FP16 CAP. >> > The only drivers supporting FP16 are currently Freedreno and Panfrost. >> > >> >> I think in general it should be ok. >> >> I think for ir3 we want 32b inputs/outputs for geom stages >> (vs/hs/ds/gs). For frag outs we use nir_lower_mediump_outputs.. maybe >> this is a good approach to continue, to use a simple nir lowering pass >> for cases where a shader stage can directly take 16b input/output. >> For frag inputs we fold the narrowing conversion in to the varying >> fetch instruction in backend. >> >> int16 would be pretty useful, for loop counters especially.. these can >> have a long live-range and currently wastefully occupy a full 32b reg. >> >> Uniforms we haven't cared too much about, since we can (usually) read >> a 32b uniform as a 16b and fold that directly into alu instructions.. >> we handle that in the backend. >> >> Pushing mediump support further would be great, and we can definitely >> help if it ends up needing changes in freedreno backend. The deqp >> coverage in CI should give us pretty good confidence about whether or >> not we are breaking things in the ir3 backend. >> >> BR, >> -R _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev