On Tue, 26 May 2026 at 21:54, Andriy Dorokhin <[email protected]> wrote: > I'm a newbie and would like to explore whether Boyer-Moore-Horspool > substring searching could improve performance for LIKE queries. > > This topic has come up a few times previously on pgsql-hackers. > As I understand it, the main concerns raised previously were: > * BMH is not universally faster, especially for short patterns
I don't recall the numbers now, but there were certainly regressions from using B-M-H for shorter search patterns. You can see how the overhead of that has been minimised by looking at text_position_setup(). You can see that the skip table array will increasingly share elements between multiple characters that match after applying the skiptablemask as the search length becomes smaller. That was done to save having to initialise the entire skip table array, which was too costly for smaller searches. Unfortunately, I didn't post benchmark results onto the mailing list for this. > * preprocessing costs may outweigh benefits > * LIKE semantics become more complicated when '%' and '_' appear > internally > * rewriting LIKE into strpos() is not always appropriate > * SIMD/vectorized approaches may outperform BMH on modern CPUs It certainly would be good to revisit this. I didn't ever look at SIMD for this purpose. It was 18 years ago that I worked on the B-M-H patch. It's likely overdue for someone to check what we could make better using more modern methods. > Since PostgreSQL already uses a Boyer-Moore-like approach in strpos(), I > did some benchmarks: > https://medium.com/@andreumdorokhinum/does-postgres-need-the-boyer-moore-horspool-search-algorithm-for-like-operator-00b43e4b115c > > Of course, these are only preliminary measurements. > > What should I focus next? I think it would be good to go over which sort of LIKE patterns you could make this work for and invent a design that works for those patterns. For example, we know that we could use it for patterns such as '%word%', but more complex patterns are also possible. For example, '%word1%word2%' could be done. There would need to be an additional check that 'word2' comes after 'word1', or only begin the search after the end of 'word1'. You likely would also want to support leading and trailing searches as it might surprise users if '%word1%word2%' was fast, but if you removed the leading or trailing %, it suddenly fell back to the slow search. So I suspect you'd want to have 3 search types; leading, contains, and trailing, and then generate a search key by preprocessing the search pattern, turning that into something like an array of structs on what the search should do. You'd need to be able to feed search results from previous structs into following structs so that the search for 'word2' knows that it must only start the search at a certain point. I think if you have the ability to start searches at a certain point, then you can maybe do '_' searches too. > * short-pattern regressions? Certainly, you will need to check for this. You'd obviously need to write the code first and then you might have to take measures to mitigate regressions on smaller strings. > * UTF-8 / multibyte behavior? I don't know about that part. I only ever did this for single-byte chars. I believe Heikki did the multi-byte part. I don't think it's unreasonable to start focusing on single-byte first. > At the moment I'm mainly interested in understanding whether the > community believes this is still worth exploring, and if not, what the > primary blockers are likely to be. I believe it's more useful to speed up LIKE than it ever was to speed up POSITION(). You're not going to get any pushback from anyone who thinks "LIKE is fast enough already". The key here is the balance between introduced code complexity and maintainability vs observed performance benefits. For example, if you add 1000 lines of code and we get a 1% speedup, then you'll probably fail to get this accepted. You should aim to make that ratio as favourable as possible. I.e, simple as possible and well-documented code that gives a meaningful performance improvement. If I were doing this today, I'd probably start by figuring out which patterns we can support, then write a benchmarking script which aims to test a decent amount of text and patterns of various lengths. I expect testing text lengths starting with 1 character, then double that in powers of 2 up to maybe 2048. Maybe you can have a small, medium and long search key within each of those and then code up several tests within each. For example, "[no] match, leading", "[no] match within", "[no] match trailing", and then aggregate the results by { search text len, key_len }. You should be able to graph those results showing the performance increase as a percentage (or regression). You'll find having that script very useful as you tweak the patch to iron out regressions for smaller searches. David
