> I agree with you that doing these kinds of optimizations is a difficult > task, but I am trying to focus my proposal on emulating branches and > loops for older hardware that don't have branching instructions rather > than performing global optimizations on the TGSI code. I don't think > most of the loop optimizations you listed are even possible on hardware > without branching instructions.
Yes, that's possible. In fact, if you unroll loops, those optimizations can be done after loop unrolling. This does not however necessarily change things, since while you can e.g. avoid loop-invariant code motion, you still need common subexpression elimination to remove the mutiple redundant copies of the loop-invariant code generated by unrolling. Also even loop unrolling needs to find the number of iterations, which at the very least requires simple constant folding, and potentially a whole suite of complex optimization to work in all possible Some of the challenges of this were mentioned in a previous thread, as well as LLVM-related issues >> (2) Write a LLVM->TGSI backend, restricted to programs without any control >> flow > > I think (2) is probably the closest to what I am proposing, and it is > something I can take a look at. Note that this means an _input_ program without control flow, that is a control flow graph with a single basic block. Once you have more than one basic block, you need to convert the CFG for an arbitrary graph to something made of structured loops and conditionals. The problem here is that GPUs often use a "SIMT" approach. This means that the GPU internally works like an SSE CPU with vector registers (but often much wider, with up to 32 elements or even more). However, this is hidden to the programmer, by putting the variables related to several pixels in the vector, and making you think everything is a scalar or just a 4-component vector This works fine as long as there is no control flow; however when you reach a conditional jump, some pixels may want to take one path and some others another path. The solution is to have an "execution mask" and do not write to any pixels not in the execution masks. When and if/else/endif structure is encountered, if the pixels all take the same path, things work like CPUs; if that is not the case, both branches are executed with the appropriate execution masks, and things continue normally after the endif. The problem here is that this needs a structure if/else/endif formulation as opposed to arbitrary gotos. However LLVM and most optimizers work in arbitrary-goto formulation, which needs to be converted to a structured approach. The above all applies for GPU with hardware control flow. However, even without it, you have the same issue of reconstructing if/else/endif blocks, since you need to basically do the same in software, using a the if conditional to choose between results computed by the branches. Converting a control flow graph to a structured program is always possible, but doing it well requires some thought. In particular, you need to be careful to not break DDX instructions, which operate on a 2x2 block of pixels, and will thus behave differently if some of the other things have diverged away due to control flow modifications. This may require to make sure control flow optimizations do not duplicate them, and possibly other issues. Using an ad-hoc optimizer does indeed sidestep the issue, but only as long as you don't try to do non-trivial control flow optimization or changes. In that case, those may be best expressed on an arbitrary control flow graph (e.g. the issue with converting "continue" to if/end), and at this point you would need to add that logic anyway. At any rate, I'm not sure whether this is suitable for your GSoC project or not. My impression is that using an existing compiler would prove to be more widely useful and more long lasting, especially considering that we are moving towards applications and hardware with very complex shader support (consider the CUDA/OpenCL shaders and the very generic GPU shading capabilities). An ad-hoc TGSI optimizer will probably prove unsuitable for efficient code generation for, say, scientific applications using OpenCL, and would need to be later replaced. So my personal impression (which could be wrong) is that using an existing optimizer, while possibly requiring an higher initial investment, should have much better payoffs in the long run, by making everything beyond the initial TGSI->LLVM->TGSI work already done or easier to do. >From a coding perspective, you lose the "design and write everything myself from scratch" aspect, but you gain experience with a complex and real-world compiler, and are able to write more complex optimizations and transforms due to having a well-developed infrastructure allowing to express them easily. Furthermore, hopefully using a real compiler would result in seeing your work producing very good code in all cases, while an ad-hoc optimizer would impove the current situation, but most likely the resulting code would still be blatantly suboptimal. Another advantage would be presumably seeing the work used indefinitely and built upon for projects such as OpenCL/compute shaders support. It may be more or less time consuming, depending on the level of sophistication of the ad-hoc optimizer. By the way, it would be interesting to know what people who are working on related things think about this (CCed them). In particular, Zack Rusin has worked extensively with LLVM and I think a prototype OpenCL implementation. Also, PathScale is interested in GPU code generation and may contribute something based on Open64 and its IR, WHIRL. However, I'm not sure whether this could work as a general optimizing framework, or instead just as a backend code generator for some drivers (e.g. nv50). In particular, it may be possible to use LLVM to do architecture independent optimizations and then convert to WHIRL if such a backend is available for the targeted GPU. BTW, LLVM seems to me superior to Open64 as an easy-to-use framework for flexibly running existing optimization passes and writing your own (due to the unified IR and existing wide adoption for such purpose) so we may want to have it even if a Open64-based GPU backends where to become available; however, I might be wrong on this. The way I see it, it is a fundamental Mesa/Gallium issue, and should really be solved in a lasting way. See the previous thread for more detailed discussion of the technical issues of an LLVM-based implementation. Again, not sure whether this is appropriate for this GSoC project, but it seemed quite worthwhile to raise this issue, since if I'm correct, using an existing optimizer (LLVM is the default candidate here) could produce better results and avoid ad-hoc work that would be scrapped later. I may consider doing this myself, either as a GSoC proposal if still possible, or otherwise, if no one else does before, and time permits (the latter issue is the major problem here...) ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Mesa3d-dev mailing list Mesa3d-dev@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mesa3d-dev