> I agree with you that doing these kinds of optimizations is a difficult
> task, but I am trying to focus my proposal on emulating branches and
> loops for older hardware that don't have branching instructions rather
> than performing global optimizations on the TGSI code.  I don't think
> most of the loop optimizations you listed are even possible on hardware
> without branching instructions.

Yes, that's possible.
In fact, if you unroll loops, those optimizations can be done after
loop unrolling.

This does not however necessarily change things, since while you can
e.g. avoid loop-invariant code motion, you still need common
subexpression elimination to remove the mutiple redundant copies of
the loop-invariant code generated by unrolling.

Also even loop unrolling needs to find the number of iterations, which
at the very least requires simple constant folding, and potentially a
whole suite of complex optimization to work in all possible

Some of the challenges of this were mentioned in a previous thread, as
well as LLVM-related issues

>> (2) Write a LLVM->TGSI backend, restricted to programs without any control 
>> flow
>
> I think (2) is probably the closest to what I am proposing, and it is
> something I can take a look at.

Note that this means an _input_ program without control flow, that is
a control flow graph with a single basic block.

Once you have more than one basic block, you need to convert the CFG
for an arbitrary graph to something made of structured loops and
conditionals.

The problem here is that GPUs often use a "SIMT" approach.
This means that the GPU internally works like an SSE CPU with vector
registers (but often much wider, with up to 32 elements or even more).
However, this is hidden to the programmer, by putting the variables
related to several pixels in the vector, and making you think
everything is a scalar or just a 4-component vector

This works fine as long as there is no control flow; however when you
reach a conditional jump, some pixels may want to take one path and
some others another path.
The solution is to have an "execution mask" and do not write to any
pixels not in the execution masks.

When and if/else/endif structure is encountered, if the pixels all
take the same path, things work like CPUs; if that is not the case,
both branches are executed with the appropriate execution masks, and
things continue normally after the endif.

The problem here is that this needs a structure if/else/endif
formulation as opposed to arbitrary gotos.

However LLVM and most optimizers work in arbitrary-goto formulation,
which needs to be converted to a structured approach.

The above all applies for GPU with hardware control flow.
However, even without it, you have the same issue of reconstructing
if/else/endif blocks, since you need to basically do the same in
software, using a the if conditional to choose between results
computed by the branches.

Converting a control flow graph to a structured program is always
possible, but doing it well requires some thought.
In particular, you need to be careful to not break DDX instructions,
which operate on a 2x2 block of pixels, and will thus behave
differently if some of the other things have diverged away due to
control flow modifications.
This may require to make sure control flow optimizations do not
duplicate them, and possibly other issues.

Using an ad-hoc optimizer does indeed sidestep the issue, but only as
long as you don't try to do non-trivial control flow optimization or
changes.
In that case, those may be best expressed on an arbitrary control flow
graph (e.g. the issue with converting "continue" to if/end), and at
this point you would need to add that logic anyway.


At any rate, I'm not sure whether this is suitable for your GSoC project or not.

My impression is that using an existing compiler would prove to be
more widely useful and more long lasting, especially considering that
we are moving towards applications and hardware with very complex
shader support (consider the CUDA/OpenCL shaders and the very generic
GPU shading capabilities).

An ad-hoc TGSI optimizer will probably prove unsuitable for efficient
code generation for, say, scientific applications using OpenCL, and
would need to be later replaced.

So my personal impression (which could be wrong) is that using an
existing optimizer, while possibly requiring an higher initial
investment, should have much better payoffs in the long run, by making
everything beyond the initial TGSI->LLVM->TGSI work already done or
easier to do.

>From a coding perspective, you lose the "design and write everything
myself from scratch" aspect, but you gain experience with a complex
and real-world compiler, and are able to write more complex
optimizations and transforms due to having a well-developed
infrastructure allowing to express them easily.

Furthermore, hopefully using a real compiler would result in seeing
your work producing very good code in all cases, while an ad-hoc
optimizer would impove the current situation, but most likely the
resulting code would still be blatantly suboptimal.

Another advantage would be presumably seeing the work used
indefinitely and built upon for projects such as OpenCL/compute
shaders support.

It may be more or less time consuming, depending on the level of
sophistication of the ad-hoc optimizer.

By the way, it would be interesting to know what people who are
working on related things think about this (CCed them).
In particular, Zack Rusin has worked extensively with LLVM and I think
a prototype OpenCL implementation.
Also, PathScale is interested in GPU code generation and may
contribute something based on Open64 and its IR, WHIRL.
However, I'm not sure whether this could work as a general optimizing
framework, or instead just as a backend code generator for some
drivers (e.g. nv50).
In particular, it may be possible to use LLVM to do architecture
independent optimizations and then convert to WHIRL if such a backend
is available for the targeted GPU.
BTW, LLVM seems to me superior to Open64 as an easy-to-use framework
for flexibly running existing optimization passes and writing your own
(due to the unified IR and existing wide adoption for such purpose) so
we may want to have it even if a Open64-based GPU backends where to
become available; however, I might be wrong on this.
The way I see it, it is a fundamental Mesa/Gallium issue, and should
really be solved in a lasting way.

See the previous thread for more detailed discussion of the technical
issues of an LLVM-based implementation.

Again, not sure whether this is appropriate for this GSoC project, but
it seemed quite worthwhile to raise this issue, since if I'm correct,
using an existing optimizer (LLVM is the default candidate here) could
produce better results and avoid ad-hoc work that would be scrapped
later.

I may consider doing this myself, either as a GSoC proposal if still
possible, or otherwise, if no one else does before, and time permits
(the latter issue is the major problem here...)

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Mesa3d-dev mailing list
Mesa3d-dev@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mesa3d-dev

Reply via email to