Re: Speeding up text_position_next with multibyte encodings

Kyotaro HORIGUCHI Mon, 22 Oct 2018 21:21:28 -0700

Hello.

At Fri, 19 Oct 2018 15:52:59 +0300, Heikki Linnakangas <hlinn...@iki.fi> wrote 
in <3173d989-bc1c-fc8a-3b69-f24246f73...@iki.fi>
> Attached is a patch to speed up text_position_setup/next(), in some
> common cases with multibyte encodings.
> 
> text_position_next() uses the Boyer-Moore-Horspool search algorithm,
> with a skip table. Currently, with a multi-byte encoding, we first
> convert both input strings to arrays of wchars, and run the algorithm
> on those arrays.
> 
> This patch removes the mb->wchar conversion, and instead runs the
> single-byte version of the algorithm directly on the inputs, even with
> multi-byte encodings. That avoids the mb->wchar conversion altogether,
> when there is no match. Even when there is a match, we don't need to
> convert the whole input string to wchars. It's enough to count the
> characters up to the match, using pg_mblen(). And when
> text_position_setup/next() are used as part of split_part() or
> replace() functions, we're not actually even interested in the
> character position, so we can skip that too.


Sounds reasonable. Partial matching character is just
ignored. The conversion is simply useless.

> Avoiding the large allocation is nice, too. That was actually why I
> stated to look into this: a customer complained that text_position
> fails with strings larger than 256 MB.
> 
> Using the byte-oriented search on multibyte strings might return false
> positives, though. To make that work, after finding a match, we verify
> that the match doesn't fall in the middle of a multi-byte
> character. However, as an important special case, that cannot happen
> with UTF-8, because it has the property that the byte sequence of one
> character never appears as part of another character. I think some
> other encodings might have the same property, but I wasn't sure.

Yes. That happens for at least EUC_JP, where the same byte can
appear in arbitrary position. Maybe we can hard code to restrict
that to UTF-8 for the first step. (I'm not sure the second step
happens.)

> This is a win in most cases. One case is slower: calling position()
> with a large haystack input, where the match is near the end of the
> string. Counting the characters up to that point is slower than the
> mb->wchar conversion. I think we could avoid that regression too, if
> we had a more optimized function to count characters. Currently, the
> code calls pg_mblen() in a loop, whereas in pg_mb2wchar_with_len(),
> the similar loop through all the characters is a tight loop within the
> function.
> 
> Thoughts?

Coudn't we count characters while searching a match?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Speeding up text_position_next with multibyte encodings

Reply via email to