Hello. At Fri, 19 Oct 2018 15:52:59 +0300, Heikki Linnakangas <hlinn...@iki.fi> wrote in <3173d989-bc1c-fc8a-3b69-f24246f73...@iki.fi> > Attached is a patch to speed up text_position_setup/next(), in some > common cases with multibyte encodings. > > text_position_next() uses the Boyer-Moore-Horspool search algorithm, > with a skip table. Currently, with a multi-byte encoding, we first > convert both input strings to arrays of wchars, and run the algorithm > on those arrays. > > This patch removes the mb->wchar conversion, and instead runs the > single-byte version of the algorithm directly on the inputs, even with > multi-byte encodings. That avoids the mb->wchar conversion altogether, > when there is no match. Even when there is a match, we don't need to > convert the whole input string to wchars. It's enough to count the > characters up to the match, using pg_mblen(). And when > text_position_setup/next() are used as part of split_part() or > replace() functions, we're not actually even interested in the > character position, so we can skip that too.
Sounds reasonable. Partial matching character is just ignored. The conversion is simply useless. > Avoiding the large allocation is nice, too. That was actually why I > stated to look into this: a customer complained that text_position > fails with strings larger than 256 MB. > > Using the byte-oriented search on multibyte strings might return false > positives, though. To make that work, after finding a match, we verify > that the match doesn't fall in the middle of a multi-byte > character. However, as an important special case, that cannot happen > with UTF-8, because it has the property that the byte sequence of one > character never appears as part of another character. I think some > other encodings might have the same property, but I wasn't sure. Yes. That happens for at least EUC_JP, where the same byte can appear in arbitrary position. Maybe we can hard code to restrict that to UTF-8 for the first step. (I'm not sure the second step happens.) > This is a win in most cases. One case is slower: calling position() > with a large haystack input, where the match is near the end of the > string. Counting the characters up to that point is slower than the > mb->wchar conversion. I think we could avoid that regression too, if > we had a more optimized function to count characters. Currently, the > code calls pg_mblen() in a loop, whereas in pg_mb2wchar_with_len(), > the similar loop through all the characters is a tight loop within the > function. > > Thoughts? Coudn't we count characters while searching a match? regards. -- Kyotaro Horiguchi NTT Open Source Software Center