On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinn...@iki.fi> wrote: > > I also tested the fallback implementation from the simdjson library > (included in the patch, if you uncomment it in simdjson-glue.c): > > mixed | ascii > -------+------- > 447 | 46 > (1 row) > > I think we should at least try to adopt that. At a high level, it looks > pretty similar your patch: you load the data 8 bytes at a time, check if > there are all ASCII. If there are any non-ASCII chars, you check the > bytes one by one, otherwise you load the next 8 bytes. Your patch should > be able to achieve the same performance, if done right. I don't think > the simdjson code forbids \0 bytes, so that will add a few cycles, but > still.
That fallback is very similar to my "inline C" case upthread, and they both actually check 16 bytes at a time (the comment is wrong in the patch you shared). I can work back and show how the performance changes with each difference (just MacOS, clang 10 here): master mixed | ascii -------+------- 757 | 366 v1, but using memcpy() mixed | ascii -------+------- 601 | 129 remove zero-byte check: mixed | ascii -------+------- 588 | 93 inline ascii fastpath into pg_utf8_verifystr() mixed | ascii -------+------- 595 | 71 use 16-byte stride mixed | ascii -------+------- 652 | 49 With this cpu/compiler, v1 is fastest on the mixed input all else being equal. Maybe there's a smarter way to check for zeros in C. Or maybe be more careful about cache -- running memchr() on the whole input first might not be the best thing to do. -- John Naylor EDB: http://www.enterprisedb.com