I use ... and UTF-32 for most internal processing that I write myself. Let people say UTF-32 is wasteful if they want; I don't tend to store huge amounts of text in memory at once, so the overhead is much less important than one code unit per character.
For performance-critical applications on the other hand, you need to use whichever UTF gives you the correct balance in speed and average storage size for your data.
If you have very large amounts of data, you'll be sensitive to cache overruns. Enough so, that UTF-32 may be disqualified from the start. I have encountered systems for which that was true.
If your 'per character' operations are based on parsing for ASCII symbols, e.g. HTML parsing, then both UTF-8 and UTF-16 allow you to process your data directly, w/o need to worry about the longer sequences. For such tasks, it may be that some processors will work faster if working in 32-bit chunks.
However, many 'inner loop' algorithms, such as copy, can be implemented using native machine words, handling multiple characters, or parts of characters, at once, independent of the UTF.
And even in those situations, the savings from that better not be offset by cache limitations.
A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider
1) 1 extra test per character (to see whether it's a surrogate)
2) special handling every 100 to 1000 characters (say 10 instructions)
3) additional cost of accessing 16-bit registers (per character)
4) reduction in cache misses (each the equivalent of many instructions)
5) reduction in disk access (each the equivaletn of many many instructions)
For many operations, e.g. string length, both 1, and 2 are no-ops, so you need to apply a reduction factor based on the mix of operations you do perform, say 50%-75%.
For many processors, item 3 is not an issue.
For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each occurrence depending on the architecture. Their relative weight depends not only on cache sizes, but also on how many other instructions per character are performed. For text scanning operations, their cost does predominate with large data sets.
Given this little model and some additional assumptions about your own project(s), you should be able to determine the 'nicest' UTF for your own performance-critical case.
A./