> The other option would be to > work some magic trying to fuse threads into vector ops. when we > see multiple scalar instructions, but this seems overly complex.
My understanding is that this is effectively how much modern shader hardware works[1]. While you have many threads, many/most are executing the same code fragment. As long as your vector unit has per-lane enable selects then a single instruction steal/vector unit effectively executes a set of "scalar" threads in parallel. For example an if-then-else block is implemented by executing both the "then" and "else" clauses as a single linear block and using the lane enable toggles to discard the unwanted results. It's no coincidence that the functionality exposed by OpenGL/OpenCL fits this technique much better than arbitrary application code you might find on a general purpose CPU. Paul [1] I'm pretty sure at least ATI and Larabee do this. Not sure about nVidia. _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
