Gary, Aside from some minor differences with end-of-second-operand determination, SRST and STRTU do pretty much the same thing, and it doesn't take the CPU any longer to compare one byte versus two. So the only possible explanation that I can think of to account for the differences in performance is the possibility that your operands are not equivalently aligned, and the STSTU case is experiencing cache-miss or page-fault delays that don't occur with SRST.
I also agree with Ed Jaffe ... if you have the flexibility to use the vector instructions, string searches can be spiffed up quite a bit.
