> > I guess people will complain soon enough if this causes horrible performance > regressions in vectorized code.
Not having looked at your patch in great detail,. surely what we don't want is a situation where 2 constant permutations are converted into one generic permute. Based on a quick read of your patch I couldn't work that out. It might be that 2 constant permutes are cheaper than a generic permute. Have you looked at any examples in that space . I surely wouldn't like to see a sequence of interleave / transpose change into a generic permute operation on Neon as that would be far more expensive than this. It surely needs more testting than just this bit before going in. The reason being that this would likely take more registers and indeed produce loads of a constant pool for the new mask. regards, Ramana