On Mon, Apr 9, 2018 at 1:35 AM, Jason Ekstrand <ja...@jlekstrand.net> wrote:
> Rather lively discussion we've got going here...
> On Sun, Apr 8, 2018 at 3:23 PM, Rob Clark <robdcl...@gmail.com> wrote:
>> On Sun, Apr 8, 2018 at 5:54 PM, Bas Nieuwenhuizen
>> <b...@basnieuwenhuizen.nl> wrote:
>> > On Sun, Apr 8, 2018 at 11:40 PM, Rob Clark <robdcl...@gmail.com> wrote:
>> >> On Sun, Apr 8, 2018 at 5:20 PM, Bas Nieuwenhuizen
>> >> <b...@basnieuwenhuizen.nl> wrote:
>> >>>>>>>>>>> +
>> >>>>>>>>>>> +   /** The mode of the underlying variable */
>> >>>>>>>>>>> +   nir_variable_mode mode;
>> >>>>>>>>>>
>> >>>>>>>>>> In fact, it seems like deref->mode is unused outside of
>> >>>>>>>>>> nir_print and
>> >>>>>>>>>> nir_validate.. for logical addressing we can get the mode from
>> >>>>>>>>>> the
>> >>>>>>>>>> deref_var->var at the start of the chain, and deref->mode has
>> >>>>>>>>>> no
>> >>>>>>>>>> meaning for physical addressing (where the mode comes from the
>> >>>>>>>>>> pointer).
>> >>>>>>>>>>
>> >>>>>>>>>> So maybe just drop deref->mode?
>> >>>>>>>>>
>> >>>>>>>>> Isn't it still useful with logical addressing in case a var is
>> >>>>>>>>> not
>> >>>>>>>>> immediately available? (think VK_KHR_variable_pointers)
> Yes, the idea here is to basically be the NIR equivalent of storage classes.
> For cases where you can chase it all the way back to the variable, this is
> pointless.  For cases where you can't, this can be very useful.
>> >>>>>>>> not sure, maybe this should just also use fat-pointers like
>> >>>>>>>> physical
>> >>>>>>>> addressing does??
> I *think* (without knowing the future with perfect accuracy) that I'd like
> us to start moving away from fat pointers in SPIR-V -> NIR.  They've been
> working ok thus far but we already have had complaints from the radv guys
> that they don't want fat pointers for SLM and I'm not sure there that great
> of a fit for other things.  The big reason is that it's actually fairly
> painful to change from one fat pointer scheme to another if whatever the
> spirv_to_nir authors picked doesn't really fit your hardware.
> I'm not 100% sure what to do but the idea I've got at the moment is
> something like this:
>  1) For every mode (or storage class), the driver provides to spirv_to_nir a
> glsl_type which is the type it wants the pointer represented as.
>  2) Add a deref_to_pointer intrinsic to convert the deref to that type and
> use casts to convert the fat pointer to a deref again
> Why the deref_to_pointer intrinsic?  Because right now some annoying details
> about nir_intrinsic_info force us to have all drefs have a single component.
> If we didn't, then nir_intrinsic_store_var would have two variable-size
> sources which aren't the same size.  We could modify the rules and have a
> concept of a "deref source" and to nir_intrinsic_info and say that a deref
> source can be any size.  That would also work and may actually be a better
> plan.  I'm open to other suggestions as well.
> The key point, however, is (1) where we just let the driver choose the
> storage type of pointers and then they can have whatever lowering pass they
> want.  If that's a "standard" fat-pointers pass in nir_lower_io, so be it.
> If they want to translate directly into LLVM derefs, they can do that too, I
> suppose.
>> >>>>>>>>> Also I could see this being useful in physical addressing too to
>> >>>>>>>>> avoid
>> >>>>>>>>> all passes working with derefs needing to do the constant
>> >>>>>>>>> folding?
> Constant folding is cheap, so I'm not too worried about that.  The bigger
> thing that actual derefs gain us is the ability of the compiler to reason
> about derefs.  Suppose, for instance, that you had the following:
> struct S {
>    float a;
>    float b;
> };
> location(set = 0, binding = 0) Block {
>    S arr[10];
> } foo;
> void
> my_func(int i)
> {
>    foo.arr[i].a = make_a_value();
>    foo.arr[i].b = make_another_value();
>    use_value(foo.arr[i].a);
> }
> If everything is lowered to fat pointers, the compiler can't very easily
> tell that the second line of my_func() didn't stomp foo.arr[i].a and so it
> has to re-load the value in the third line.  With derefs, we can easily see
> that the second line didn't stomp it and just re-use the re-use the result
> of the make_a_value() call.  Yes, you may be able to tell the difference
> between foo.arr[i].a and foo.arr[i].b with some fancy analysis and
> arithmetic tracking but it'd be very painful to do.  I've given this
> substantial thought for almost two years now and haven't actually come up
> with a good way to do it without the information provided by derefs.
> The biggest thing that I think we will gain from deref instructions isn't
> fancy syntax sugar and better nir_printability of pointers.  It also isn't
> the removal of explicit of fat pointers in spirv_to_nir (thought that is
> also nice).  The biggest thing I'm going for is improving NIR's ability to
> optimize UBO, SSBO, and SLM access.  We're pretty good today for local
> variables (though there are a couple of known places for improvements) but
> we're utterly rubbish for anything that leaves the current shader
> invocation.  If we want competent compute performance (which we do!) we need
> to solve that problem.
>> >>>>>>>> The problem is that you don't necessarily know the type at
>> >>>>>>>> compile
>> >>>>>>>> time (and in the case where you do, you need to do constant
>> >>>>>>>> folding to
>> >>>>>>>> figure it out)
>> >>>>>>>
>> >>>>>>> So I have two considerations here
>> >>>>>>>
>> >>>>>>> 1) for vulkan you always know the mode, even when you don't know
>> >>>>>>> the var.
>> >>>>>>> 2)  In CL the mode can still get annotated in the source program
>> >>>>>>> (CL C
>> >>>>>>> non-generic pointers) in cases in which we cannot reasonably
>> >>>>>>> figure it
>> >>>>>>> out with just constant folding. In those cases the mode is extra
>> >>>>>>> information that you really lose.
>> >>>>>>
>> >>>>>> so, even in cl 1.x, you could do things like 'somefxn(foo ?
>> >>>>>> global_ptr
>> >>>>>> : local_ptr)'.. depending on how much we inline all the things,
>> >>>>>> that
>> >>>>>> might not get CF'd away.
>> >>>
>> >>> How does this even work btw? somefxn has a definition, and the
>> >>> definition specifies a mode for the argument right? (which is
>> >>> implicitly __private if the app does not specify anything?)
>> >>
>> >> iirc, the cl spec has an example something along these lines..
>> >>
>> >> it doesn't require *physical* storage for anything where you don't
>> >> know what the ptr type is, however.. so fat ptrs in ssa space works
>> >> out
>> >>
>> >>>>>
>> >>>>> But something like
>> >>>>> __constant int *ptr_value = ...;
>> >>>>> store ptr in complex data structure.
>> >>>>> __constant int* ptr2 = load from complex data structure.
>> >>>>>
>> >>>>> Without explicitly annotating ptr2 it is unlikely that constant
>> >>>>> folding would find that ptr2 is pointing to __constant address
>> >>>>> space.
>> >>>>> Hence removing the modes loses valuable information that you cannot
>> >>>>> get back by constant folding. However, if you have a pointer with
>> >>>>> unknown mode, we could have a special mode (or mode_all?) and you
>> >>>>> can
>> >>>>> use the uvec2 representation in that case?
> I've thought about using a mode of 0 for this or maybe letting you OR all
> the possible modes together since nir_variable_mode is, after all, a
> bitfield.  I don't have any huge preference at the moment.
>> >>>> hmm, I'm not really getting how deref->mode could magically have
>> >>>> information that fatptr.y doesn't have.. if the mode is known, vtn
>> >>>> could stash it in fatptr.y and everyone is happy?  If vtn doesn't
>> >>>> know
>> >>>> this, then I don't see how deref->mode helps..
>> >>>
>> >>> You mean insert it into the fatptr every time deref_cast is called?
>> >>>
>> >>> Wouldn't that blow up the IR size significantly for very little
>> >>> benefit?
>> >>
>> >> in an easy to clean up way, so meh?
>> >
>> > We can't clean it up if we want to keep the information. Also nir is
>> > pretty slow to compile already, so I'd like not to add a significant
>> > number of instruction for very little benefit.
> Really?  I mean, I can believe it, but do you have any actual numbers to
> back this up?  It's considerably faster than some other IRs we have in mesa
> though they are known to be pretty big pigs if we're honest.  I'm very open
> to any suggestions on how to improve compile times.  If NIR is fundamentally
> a pig, we should fix that.
> I don't think compile time should be the deciding factor in whether or not
> we use fat pointers as I doubt it will have a huge effect.  However, how
> much work it takes to get at information is a reasonable argument.
>> I guess I'm failing to see what information you'd loose.. or how there
>> would be a problem..
>> I'm not strictly against having nir_var_unknown as a possible value
>> for deref->mode, but I'm not really seeing how that helps anything,
>> other than adding extra complexity to the IR..
>> >>>>>>
>> >>>>>> I think I'm leaning towards using fat ptrs for the vk case, since I
>> >>>>>> guess that is a case where you could always expect
>> >>>>>> nir_src_as_const_value() to work, to get the variable mode.  If for
>> >>>>>> no
>> >>>>>> other reason than I guess these deref's, if the var is not known,
>> >>>>>> start w/ deref_cast, and it would be ugly for deref_cast to have to
>> >>>>>> work differently for compute vs vk.  But maybe Jason already has
>> >>>>>> some
>> >>>>>> thoughts about it?
> I don't see why they would be different.  Vulkan and CL are starting to
> converge.  In vulkan, you already can't always chase the pointer back to the
> variable VK_KHR_variable_pointers is used.  I don't think that CL is as
> special as you think it is.  The only thing special about CL is the fact
> that global and local are in the same address space.  Beyond that, all the
> problems we have to solve are fundamentally the same.
>> >>>>> I'd like to avoid fat pointers alltogether on AMD since we would not
>> >>>>> use it even for CL. a generic pointer is just a uint64_t for us,
>> >>>>> with
>> >>>>> no bitfield in there for the address space.
> I tend to agree with this.  (See above discussion).
>> >>>>> I think we may need to think a bit more about representation
>> >>>>> however,
>> >>>>> as e.g. for AMD a pointer is typically 64-bits (but we can do e.g.
>> >>>>> 32-bits for known workgroup pointers), the current deref
>> >>>>> instructions
>> >>>>> return 32-bit, and you want something like a uvec2 as pointer
>> >>>>> representation?
>> >>>>
>> >>>> afaiu, newer AMD (and NV) hw can remap shared/private into a single
>> >>>> global address space..  But I guess that is an easy subset of the
>> >>>> harder case where drivers need to use different instructions.. so a
>> >>>> pretty simple lowering pass run before lower_io could remap things
>> >>>> that use fatptrs into something that ignores fatptr.y.  Then opt
>> >>>> passes make fatptr.y go away.  So both AMD and hw that doesn't have a
>> >>>> flat address space are happy.
>> >>>
>> >>> But then you run into other issues, like how are you going to stuff a
>> >>> 64-bit fatptr.x + a ?-bit fatptr.y into a 64-bit value for Physical64
>> >>> addressing? Also this means we have to track to the sources back to
>> >>> the cast/var any time we want to do anything at all with any deref
>> >>> which seems less efficient to me than just stuffing the deref in
>> >>> there.
>> >>
>> >> so fat ptrs only have to exist in ssa space, not be stored to
>> >> something with a physically defined size..
>> >
>> > how does storing __generic pointers work then? those still need the
>> > fat bit for your hw right?
>> for hw that can't map everything into a single flat address space, you
>> *can't* store a fat pointer.. oh well, it doesn't mean that that hw
>> can't implement lesser ocl versions (since this is defn not required
>> for ocl 1.x and I don't *think* it is required for 2.x, but I haven't
>> checked the spec.. I guess if intel supports 2.x then it isn't
>> required..)
> You can still carve up the address space and put local memory at some
> absurdly high address and do
> if (ptr > 0xffffffffffff0000)
>    store_local(ptr & 0xffff)
> else
>    store_global(ptr)

That might be an option if we need to physically store a fat pointer
(but I'm not convinced we need to).  But as a general solution it is a
horrible idea.  You either loose information (since when was TMI a
problem for compilers?) or you get to teach constant folding that even
though it doesn't know the value of 'a' it does know the value of 'a &
0xffffffffffff0000', which is insanity.

And simply not optimizing away the global/local/private turns every
load/store into something that looks like:

   cmps.f.lt p0.x, c4.x, r0.z
   ; delay 6 instructions so branch can see the value in p0.x
   br !p0.x, #3
   ldg.u32 r0.w, r0.y, 1
   jump #11
   (jp)cmps.f.lt p0.x, c4.y, r0.z
   (rpt5)nop   ; delay 6 instruction slots
   br !p0.x, #8
   ldl.u32 r0.w, r0.y, 1
   jump #3
   ; private is really just global with a driver provided buffer
   ; and compiler computed offset
   (jp)add.s r0.y, c10.x   ; add base address of buffer
   (rpt3)nop               ; alu results not immediately avail and no
other instr to fill the empty slots
   add.s r0.y, ... offset calculated from local_id
   (rpt5)nop               ; 6 slots for result from alu avail to mem
   ldg.u32 r0.w, r0.y, 1
   (jp)(sy)... and now we have a result..

which is simply not an option!  And the result is actually worse since
we end up with 64b pointer math lowered to 32b instructions!

tbh, I *really* don't see the problem with fat pointers, but if
someone else doesn't want deref_cast to use fat pointers, then I must
insist on a different intrinsic which does, and a compiler option for
vtn to choose which to emit.


>> >>
>> >> As far as tracking things to the head of the chain of deref
>> >> instructions, that is a matter of a simple helper or two.  Not like
>> >> the chain of deref's is going to be 1000's of instructions..
> Yes and no (mostly no) once you start throwing in phis, selects, and loading
> pointers from variables.  The first two you can explain away by just saying,
> "follow the phi/select sources" but the moment you load a pointer out of a
> variable, you have to do something.  Sure, you could just drop .y and stomp
> it to the known storage class, but that hardly seems ideal.
>> >>> Also, what would the something which ignores fatptr.y be? I'd assume
>> >>> that would be the normal deref based stuff, but requiring fatptr
>> >>> contradicts that?
>> >>
>> >> if you have a flat address space, maybe a pass (or option for
>> >> lower_io) to just convert everything to load/store_global (since
>> >> essentially what these GPUs are doing is remapping shared/private into
>> >> the global address space)
>> >
>> > but I'd like to only do that if we really don't know the mode
>> > statically since it is somewhat slower/less flexible. (also radv is
>> > unlikely to use nir_lower_io for a lot of this stuff since we can
>> > lower derefs into LLVM GEPs directly)
>> I mean, we control all the src code, we can add more options to lower_io.
>> >
>> > Hence if we want the cases where we know the mode statically we need
>> > to not lower the fatptr, but then we have the whole fatptr mess.
>> >
>> well, fatptr mess is unavoidable.. I mostly just fail to see how a
>> more general solution (ie. storing the variable mode in ssa) is worse
>> than having to deal with both cases (in ssa or in instr).  The in ssa
>> case is something that is easy to recover since anything that does
>> pointer math has to split out the .y component and fuse it back in
>> after the pointer math.  (And if you have a flat address space, I
>> guess I'm failing to see why you care about the mode in the first
>> place.)
> Yes, fat pointers are inevitable for some hardware at some stage in the
> compile.  That does not, however, mean that we want fat pointers straight
> out of SPIR-V.
mesa-dev mailing list

Reply via email to