neilconway commented on issue #20294:
URL: https://github.com/apache/datafusion/issues/20294#issuecomment-3886061938

   * The previous ASCII fastpath avoids `string::find` because it is slow. 
Indeed, it is -- but it is mainly slow because it has a lot of per-call 
overhead. The previous fast-path algorithm (slide a window using `[T]::windows` 
and check each window for an exact match) doesn't have per-call overhead but 
the algorithm itself is slow.
   • memchr::memchr is considerably faster because of SIMD and other chicanery. 
It also has very low per-call overhead (unlike `string::find` or 
`memmem::find`).
   
   ### Possible Further Improvements
   
   • For the common case that the needle is a scalar, we could explore building 
a `memmem::Finder` for the needle once and then applying it to the haystack.
   • For UTF inputs, we could still use `memchr` or a similar byte-based 
search, but once we find a match, we'd need to iterate over `char_indexes` to 
find the right character (not byte) offset. This would be a clear win when the 
needle isn't found in the haystack, but the net performance change for 
successful matches would be less clear.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to