Jason Ekstrand <ja...@jlekstrand.net> writes: > On Fri, Apr 3, 2015 at 8:37 AM, Francisco Jerez <curroje...@riseup.net> wrote: >> Jason Ekstrand <ja...@jlekstrand.net> writes: >> >>> On Fri, Apr 3, 2015 at 7:28 AM, Francisco Jerez <curroje...@riseup.net> >>> wrote: >>>> Jason Ekstrand <ja...@jlekstrand.net> writes: >>>> >>>>> On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez <curroje...@riseup.net> >>>>> wrote: >>>>>> Jason Ekstrand <ja...@jlekstrand.net> writes: >>>>>> >>>>>>> Instead of the complicated and broken-by-design pile of heuristics we >>>>>>> had >>>>>>> before, we now have a straightforward lowering: >>>>>>> >>>>>>> 1) All header sources are copied directly using force_writemask_all >>>>>>> and, >>>>>>> since they are guaranteed to be a single register, there are no >>>>>>> force_sechalf issues. >>>>>>> >>>>>>> 2) All non-header sources are copied using the exact same force_sechalf >>>>>>> and saturate modifiers as the LOAD_PAYLOAD operation itself. >>>>>>> >>>>>>> 3) In order to accommodate older gens that need interleaved colors, >>>>>>> lower_load_payload detects when the destination is a COMPR4 register >>>>>>> and automatically interleaves the non-header sources. The >>>>>>> lower_load_payload pass does the right thing here regardless of >>>>>>> whether >>>>>>> or not the hardware actually supports COMPR4. >>>>>> >>>>>> I had a quick glance at the series and it seems to be going in the right >>>>>> direction. >>>>>> >>>>>> One thing I honestly don't like is the ad-hoc and IMHO premature >>>>>> treatment of payload headers, it still feels like the LOAD_PAYLOAD >>>>>> instruction has more complex semantics than necessary and the benefit is >>>>>> unclear to me. >>>>>> >>>>>> I suppose that your motivation was to avoid setting force_writemask_all >>>>>> in LOAD_PAYLOAD instructions with header. The optimizer should be able >>>>>> to cope with those flags and get rid of them from the resulting moves >>>>>> where they are redundant, and if it's not able to it looks like >>>>>> something that should be fixed anyway. The explicit handling of headers >>>>>> is responsible for much of the churn in this series and is likely to >>>>>> complicate users of LOAD_PAYLOAD and optimization passes that have to >>>>>> manipulate them. >>>>> >>>>> Avoiding force_writemask_all is only half of the motivation and the >>>>> small half at that. A header source, more properly defined, is a >>>>> single physical register that, conceptually, applies to all channels. >>>>> Effectively, a header source (I should have stated this clearly) has >>>>> two properties: >>>>> >>>>> 1) It has force_writemask_all set >>>>> 2) It is exactly one physical hardware register. >>>>> >>>>> This second property is the more important of the two. Most of the >>>>> disaster of the previous LOAD_PAYLOAD implementation was that we did a >>>>> pile of guesswork and had a ill-conceved "effective width" think in >>>>> order to figure out how big the register actually was. Making the >>>>> user specify which sources are header sources eliminates that >>>>> guesswork. It also has the nice side-effect that we can do the right >>>>> force_writemask_all and we can properly handle COMPR4 for the the >>>>> user. > > Ok, Allow me to be a bit more explicit as to what all we need to keep track > of: > > 1) How big is the source register for real. Even immediates can end > up being two registers in the copy. > 2) Do we want force_writemask_all? > 3) If not, do we want force_sechalf? > 4) On g45 and gen5, we want to use COMPR4 for interlaced movs > 5) When lowering, we want to use 16-wide moves when possible in SIMD16 > > With the patch series I sent, all of this is explicit except for > COMPR4 which is, admittedly, kind of magic. "Which of these sources > are headers?" is a reasonable question for the caller to answer. It > knows explicitly and it would take the LOAD_PAYLOAD helper some work > to guess it correctly. Another option would be to guess that based on > exec sizes but then the caller has to know not to pass in the wrong > register type or the guess will be wrong. I like explicit. > I'm not advocating LOAD_PAYLOAD to perform any guesswork, I'm advocating it to be more stupid -- Let it do as much as it can sensibly do with the information it already has, and no more.
>>>> Yeah, true, but this seems like the least orthogonal and most annoying >>>> to use solution for this problem, it forces the caller to provide >>>> redundant information, it takes into account the saturate flag on some >>>> arguments and not others, it shuffles sources with respect to the >>>> specified order when COMPR4 is set, but only for the first four >>>> non-header sources. I think any of the following solutions would be >>>> better-behaved than the current approach: >>> >>> I don't know that saying which sources are headers is really >>> redundant. It's explicit which is what we want. Yes, the COMPR4 >>> thing is a bit magical but we have to do COMPR4 in lower_load_payload >>> so we have to have some way of doing it and this method puts the >>> interleving code in one place instead of two. >>> >> Well, at least with the previous approach LOAD_PAYLOAD had consistent >> (if broken) semantics across its arguments, and regardless of COMPR4 >> being used or not, which IMHO is preferable to the modest code saving. > > To be clear, I don't really like the way I did COMPR4 either. I just > couldn't come up with anything better. > I don't see how the new handling of COMPR4 is more "explicit". It magically changes how the specified registers are laid out in the payload and an optimization pass dealing with LOAD_PAYLOAD must be aware of that. Your previous approach even though not beautiful was transparent in the sense that the rest of the compiler didn't have to care about LOAD_PAYLOAD using COMPR4 or not, because it was nothing more than an implementation detail with no effect on the semantics of the instruction. >>>> 1/ Use the source width to determine the size of each copy. This would >>>> imply that the source width carries semantic information and hence >>>> would have to be left alone by e.g. copy propagation. >>> >>> That's what do now and it's terrible. The "effective width" field was >>> basically a width that gets kept. >>> >> Couldn't you just prevent copy propagation from propagating a source of >> different width into a LOAD_PAYLOAD? You would still be able to get rid >> of the metadata guess and effective_width, but you may have to re-run >> copy propagate after lower_load_payload() for the case you missed any >> optimization oportunities. > > Yes, we could prevent copy propagation from messing up exec sizes. > That solves 1 but 2-5 still require terrible heuristics. Setting > force_writemask_all sort-of solves 2 and 3. Splitting the LOAD_PAYLOAD monster into two separate instructions (proposal 3) solves this too, without setting force_writemask_all. > COMPR4 would have to have a hueristic again which is too complicated. I wouldn't call that a heuristic, the rule is completely deterministic: If 'src[i+4] == offset(src[i], 1)' enable COMPR4 and skip emitting src[i+4] explicitly. > The 16-wide stuff is still a problem. > How so? With solution 1 you emit a 16-wide copy iff the source is 16-wide. With solution 3 you emit a 16-wide copy iff exec_size is 16. With solution 2 you emit a 16-wide copy if exec_size is 16 or, as an additional optimization without effect on the semantics of the instruction, if exec_size is 8, force_writemask_all is set and 'src[i+1] == offset(src[i], 1)'. >>>> 2/ Use the instruction exec size and flags to determine the properties >>>> of *all* copies. This means that if a header is present the exec >>>> size would necessarily have to be 8 and the halves of a 16-wide >>>> register would have to be specified separately, which sounds annoying >>>> at first but in practice wouldn't necessarily be because it could be >>>> handled by the LOAD_PAYLOAD() helper based on the argument widths >>>> without running into problems with optimization passes changing the >>>> meaning of the instruction. The semantics of the instruction itself >>>> would be as stupid as possible, but the implementation could still >>>> trivially recognise 16-wide and COMPR4 copies using the exact same >>>> mechanism you are using now. >>> >>> Yes, that might work. I'll try and take a swing at it today. It will >>> *hopefully* have less code churn than the solution in this series >>> because the magic will still happen, just in a different place. Of >>> course, this solution also requires that we lower everything with >>> force_writemask_all and *hopefully* we can get rid of those and set >>> force_sechalf appropriately in optimization passes. > > Thinking about this some more, I'm not really a fan. As with your > first suggestion, we can force_writemask_all and the heuristic for > COMPR4 is pretty simple. However, it still has the problem of > properly handling SIMD16. > I think I've answered to that already in the last paragraph. >>>> 3/ Split LOAD_PAYLOAD into two separate instructions, each of them dead >>>> simple, say COLLECT and LOAD_HEADER. "COLLECT dst, src0, ..., srcn" >>>> would be equivalent to the LOAD_PAYLOAD instruction described in 2, >>>> but it would only be used to load full-width non-header sources of >>>> the payload, so you would avoid the need to split 16-wide registers >>>> in half. "LOAD_HEADER dst, header, payload" would handle the >>>> asymmetric requirements of prepending a header, like using 8-wide >>>> instead of exec_size-wide copies and setting force_writemask_all. >>>> You could use mlen to specify the size of payload as is usual for >>>> instructions taking a payload source. >>> >>> I thought about doing something like that but it seems more convoluted >>> than would be at all worth it. >>> >> It's not so obvious to me that 3 would necessarily be more difficult >> than getting 2 right, but yeah it would definitely be more effort than 1. > > The problem with this solution is that headers can be different sizes. > Texturing instructions use 1-reg headers but fb writes a header that's > between 0 and 4 registers. Also, we don't want to have to look at the > size of a vgrf to figure out what the instruction does so we have to > have some way of encoding the number of registers read in both > arguments into the instruction. And why is overloading mlen better > than renaming and overloading has_header? > I'm not against renaming header_present to header_size in principle, I suggested using mlen because you'd likely need a way to specify both the size of the header and the size of the payload. There are alternatives like using immediate sources but I don't have a very strong preference. > Swinging back around to a more general discussion... > > 1) Any of the solutions currently on the table will take care of this. > I'm not a big fan of making copy-prop be LOAD_PAYLOAD-aware, but my > solutions (which I'll call 0) and 2 and 3 handle it ok. > > 2+3) My solution gives us this explicitly for free. The other option, > which you keep suggesting, is to force_writemask_all the universe. > This works, but uses more than needed which, at the moment, is ok. > However, if we want to clean it up, optimization passes will need to > be aware of it and it's not clear to me how we would handle it. > It's simple: an instruction with force_writemask_all set reading from a register assigned without it is equivalent to the same instruction without force_writemask_all. Also an instruction with force_writemask_all set writing to a register which is never used as send-from-GRF source or by other instructions with the flag set is equivalent to the same instruction without force_writemask_all. But at this point this all seems like a premature optimization to me. > 4) My solution is, admittedly, a hack but it's a predictable hack. > Also, my hack does work for all of the messages we care about and I > don't think we'll be adding new messages for gen <= 5 so I'm not > terribly concerned about it. Having a heuristic at lowering time > would be nicer because the user can simply say what registers they > want where and lowering does the right thing. However, if we have to > pick between COMPR4 being nice and other things being nice, I'd rather > leave COMPR4 a hack. > I don't see why that's a choice we have to make. You could just leave the old-style handling of COMPR4 alone. > 5) This one is tricky. CSE can generate payload copies that don't get > eaten up by register coalesce. I don't remember exactly how this > happens but I do know that I've seen it in the wild. When this > happens, you can end up with a copy payload that copies an odd number > of registers. If you use the simple heuristic of just use SIMD16 > moves until you can't and then use a SIMD8, you end up moving things > that aren't pairs as pairs. Yeah, of course, with the current approach and my proposal 1, CSE would have to preserve the widths of the sources of the original LOAD_PAYLOAD (or use force_writemask_all), otherwise you end up reading stuff across channels. > Sure, if you set force_writemask_all, the code is correct. However, > it's possible that register coalesce could have cleaned up part of the > payload but now it can't because the moves are off-center. This > problem is *why* I wrote the nasty pile of heuristics that we have > now. None of your suggestions offer a solution to this > Actually this seems less a problem with 2 or 3 because the copies are homogeneous (i.e. they are exec_size-wide). With 2 you'd just split up the temporary in 8-wide sources and assume that the LOAD_PAYLOAD will be lowered to the optimal sequence of SIMD16/SIMD8 copies. With 3 you could even use a LOAD_HEADER with an empty header. In both cases register coalesce would have to learn how to treat those as multiple register copies. > where as having the user specify the header size gives it to you for > free. If you consider "for free" checking if the instruction is a LOAD_PAYLOAD and in that case mimicking its header/payload structure in the new LOAD_PAYLOAD, sure. But it doesn't seem conceptually different to the solution you could have applied now. > All in all, I'm now even more convinced that we want the user of > LOAD_PAYLOAD to just tell us what sources are headers and what aren't. > I think it's also conceptually cleaner because the header sources are > different from the rest of the payload. > --Jason > > P.S. For whatever it's worth, it would also allow us to split > LOAD_PAYLOAD instructions from SIMD16 to SIMD8 correctly by copying > the headers directly and splitting the non-headers in half. I doubt that's something we want to do at the IR level.
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev