On 13/02/2021 03:31, John Naylor wrote:
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinn...@iki.fi <mailto:hlinn...@iki.fi>> wrote:
 >
 > I also tested the fallback implementation from the simdjson library
 > (included in the patch, if you uncomment it in simdjson-glue.c):
 >
 >   mixed | ascii
 > -------+-------
 >     447 |    46
 > (1 row)
 >
 > I think we should at least try to adopt that. At a high level, it looks
 > pretty similar your patch: you load the data 8 bytes at a time, check if
 > there are all ASCII. If there are any non-ASCII chars, you check the
 > bytes one by one, otherwise you load the next 8 bytes. Your patch should
 > be able to achieve the same performance, if done right. I don't think
 > the simdjson code forbids \0 bytes, so that will add a few cycles, but
 > still.

Attached is a patch that does roughly what simdjson fallback did, except I use straight tests on the bytes and only calculate code points in assertion builds. In the course of doing this, I found that my earlier concerns about putting the ascii check in a static inline function were due to my suboptimal loop implementation. I had assumed that if the chunked ascii check failed, it had to check all those bytes one at a time. As it turns out, that's a waste of the branch predictor. In the v2 patch, we do the chunked ascii check every time we loop. With that, I can also confirm the claim in the Lemire paper that it's better to do the check on 16-byte chunks:

(MacOS, Clang 10)

master:

  chinese | mixed | ascii
---------+-------+-------
     1081 |   761 |   366

v2 patch, with 16-byte stride:

  chinese | mixed | ascii
---------+-------+-------
      806 |   474 |    83

patch but with 8-byte stride:

  chinese | mixed | ascii
---------+-------+-------
      792 |   490 |   105

I also included the fast path in all other multibyte encodings, and that is also pretty good performance-wise.

Cool.

It regresses from master on pure multibyte input, but that case is still faster than PG13, which I simulated by reverting 6c5576075b0f9 and b80e10638e3:

I thought the "chinese" numbers above are pure multibyte input, and it seems to do well on that. Where does it regress? In multibyte encodings other than UTF-8? How bad is the regression?

I tested this on my first generation Raspberry Pi (chipmunk). I had to tweak it a bit to make it compile, since the SSE autodetection code was not finished yet. And I used generate_series(1, 1000) instead of generate_series(1, 10000) in the test script (mbverifystr-speed.sql) because this system is so slow.

master:

 mixed | ascii
-------+-------
  1310 |  1041
(1 row)

v2-add-portability-stub-and-new-fallback.patch:

 mixed | ascii
-------+-------
  2979 |   910
(1 row)

I'm guessing that's because the unaligned access in check_ascii() is expensive on this platform.

- Heikki


Reply via email to