In perl.git, the branch smoke-me/khw-dfa has been created <https://perl5.git.perl.org/perl.git/commitdiff/4495efbf31f351740b82500ee9017c954506c7aa?hp=0000000000000000000000000000000000000000>
at 4495efbf31f351740b82500ee9017c954506c7aa (commit) - Log ----------------------------------------------------------------- commit 4495efbf31f351740b82500ee9017c954506c7aa Author: Karl Williamson <k...@cpan.org> Date: Sun Jul 1 16:00:41 2018 -0600 Make isC9_STRICT_UTF8_CHAR() an inline dfa This replaces a complicated trie with a dfa. This should cut down the number of conditionals encountered in parsing many code points. commit de2aa591cbecf9ede3ec9a270efdfef9641926d2 Author: Karl Williamson <k...@cpan.org> Date: Wed Jun 27 22:01:53 2018 -0600 Make isSTRICT_UTF8_CHAR() an inline function It was a macro that used a trie. This changes to use the dfa constructed in previous commits. I didn't bother with taking measurements. A dfa should have fewer conditionals for many code points. commit 9d19b998b8703ead8851838cde6fb568f3df4e53 Author: Karl Williamson <k...@cpan.org> Date: Wed Jun 27 21:52:47 2018 -0600 Inline dfa for translating from UTF-8 This commit inlines the simple portion of the dfa that translates from UTF-8 to code points, used in functions like utf8_to_uvchr_buf. This dfa has been changed in previous commits so that it is small, and punts on any problematic input, plus 18% of the Hangul syllable code points. (These still come out faster than blead.) The smallness allows it to be inlined, adding <2000 total bytes to the perl text space. The inlined part never calls anything that needs thread context, so that parameter can be removed. I decided to remove it also from the Perl_utf8_to_uvchr_buf() and Perl_utf8n_to_uvchr_error() functions. There is a small risk that someone is actually using those functions instead of the documented macros utf8_to_uvchr_buf() and utf8n_to_uvchr_error(). If so, this can be added back in. Perl_utf8_to_uvchr_msgs() is entirely removed, but the macro utf8_to_uvchr_msgs() which is the normal interface to it is retained unchanged, and it is marked as unstable anyway. This change decreases the number of conditional branches in the Perl statement my $a = ord("\x{foo}") where foo is a non-problematic code point by about 11%, except for ASCII characters, where it is 4%, and those Hangul syllables mentioned above, where it is 7%. Problematic code points fare much worse here than in blead. These are the surrogates, non-characters, and non-Unicode code points. We don't care very much about the speed of handling these code points, which are mostly considered illegal by Unicode anyway. The percentage decrease is higher for the just the function itself, as the measured Perl statement has unchanged overhead. Here are the annotated benchmarks: Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches _m branch predict miss _m1 level 1 cache miss _mm last cache (e.g. L3) miss - indeterminate percentage (e.g. 1/0) The numbers represent raw counts per loop iteration. translate_utf8_to_uv_007f my $a = ord("\x{007f}") blead dfa Ratio % ----- ----- ------- Ir 395.0 370.0 106.8 Dr 122.0 115.0 106.1 Dw 71.0 61.0 116.4 COND 49.0 47.0 104.3 IND 5.0 5.0 100.0 In all the measurements, the indirect numbers were all zeros and unchanged, and are omitted in this message. translate_utf8_to_uv_07ff my $a = ord("\x{07ff}") blead dfa Ratio % ----- ----- ------- Ir 438.0 390.0 112.3 Dr 128.0 118.0 108.5 Dw 71.0 61.0 116.4 COND 57.0 51.0 111.8 IND 5.0 5.0 100.0 translate_utf8_to_uv_cfff my $a = ord("\x{cfff}") This is the highest Hangul syllable that gets the full reduction. blead dfa Ratio % ----- ----- ------- Ir 457.0 410.0 111.5 Dr 131.0 121.0 108.3 Dw 71.0 61.0 116.4 COND 61.0 55.0 110.9 IND 5.0 5.0 100.0 translate_utf8_to_uv_d000 my $a = ord("\x{d000}") This is the lowest affected Hangul syllable blead dfa Ratio % ----- ----- ------- Ir 457.0 443.0 103.2 Dr 131.0 132.0 99.2 Dw 71.0 71.0 100.0 COND 61.0 57.0 107.0 IND 5.0 5.0 100.0 translate_utf8_to_uv_d7ff my $a = ord("\x{d7ff}") This is the highest affected Hangul syllable blead dfa Ratio % ----- ----- ------- Ir 457.0 443.0 103.2 Dr 131.0 132.0 99.2 Dw 71.0 71.0 100.0 COND 61.0 57.0 107.0 IND 5.0 5.0 100.0 translate_utf8_to_uv_d800 my $a = ord("\x{d800}") This is a surrogate, showing much worse performance, but we don't care blead dfa Ratio % ----- ----- ------- Ir 457.0 515.0 88.7 Dr 131.0 134.0 97.8 Dw 71.0 73.0 97.3 COND 61.0 75.0 81.3 IND 5.0 5.0 100.0 translate_utf8_to_uv_fdd0 my $a = ord("\x{fdd0}") This is a non-char, showing much worse performance, but we don't care blead dfa Ratio % ----- ----- ------- Ir 457.0 548.0 83.4 Dr 131.0 139.0 94.2 Dw 71.0 73.0 97.3 COND 61.0 81.0 75.3 IND 5.0 5.0 100.0 translate_utf8_to_uv_fffd my $a = ord("\x{fffd}") blead dfa Ratio % ----- ----- ------- Ir 457.0 410.0 111.5 Dr 131.0 121.0 108.3 Dw 71.0 61.0 116.4 COND 61.0 55.0 110.9 IND 5.0 5.0 100.0 translate_utf8_to_uv_ffff my $a = ord("\x{ffff}") This is another non-char, showing much worse performance, but we don't care blead dfa Ratio % ----- ----- ------- Ir 457.0 548.0 83.4 Dr 131.0 139.0 94.2 Dw 71.0 73.0 97.3 COND 61.0 81.0 75.3 IND 5.0 5.0 100.0 translate_utf8_to_uv_1fffd my $a = ord("\x{1fffd}") blead dfa Ratio % ----- ----- ------- Ir 476.0 430.0 110.7 Dr 134.0 124.0 108.1 Dw 71.0 61.0 116.4 COND 65.0 59.0 110.2 IND 5.0 5.0 100.0 translate_utf8_to_uv_10fffd my $a = ord("\x{10fffd}") blead dfa Ratio % ----- ----- ------- Ir 476.0 430.0 110.7 Dr 134.0 124.0 108.1 Dw 71.0 61.0 116.4 COND 65.0 59.0 110.2 IND 5.0 5.0 100.0 translate_utf8_to_uv_110000 my $a = ord("\x{110000}") This is a non-Unicode code point, showing much worse performance, but we don't care blead dfa Ratio % ----- ----- ------- Ir 476.0 544.0 87.5 Dr 134.0 137.0 97.8 Dw 71.0 73.0 97.3 COND 65.0 81.0 80.2 IND 5.0 5.0 100.0 commit 0637ed2cfc1d49aaafc4db9d57d7721e8ab492a0 Author: Karl Williamson <k...@cpan.org> Date: Wed Jun 27 21:28:15 2018 -0600 utf8.c: Avoid unnecessary work xlating utf8 to uv This moves the code for the dfa that does the translation of non-problematic characters to earlier in the function to avoid work that only needs to be done if the dfa rejects the input. For example, calculating how long the sequence is needed to be no longer is done unless the dfa rejects. Since the dfa always accepts an invariant if the allowed length is non-zero, the code that tests for those specifically can be removed. commit f0c8c77ab9ae7e0a7831cb62fa7593c435c72362 Author: Karl Williamson <k...@cpan.org> Date: Wed Jun 27 18:19:54 2018 -0600 Use strict dfa to translate from UTF-8 to code point With this commit, if a sequence passes the dfa, the result can be returned immediately. Previously some rare potentially problematic sequences could pass, which would then need further checking, which then have to be done always. So this speeds up the general case. commit 37de8f890719ccbfbbd6cdd977fd74aadc260c6a Author: Karl Williamson <k...@cpan.org> Date: Wed Jun 27 18:08:12 2018 -0600 Add dfa for strict translation from UTF-8 commit 3dc960ea30f42e63176f2a6ddecbd7f691099651 Author: Karl Williamson <k...@cpan.org> Date: Sun Jul 1 13:48:34 2018 -0600 Fix outdated docs for isUTF8_char() It doesn't accept non-negative code points that don't fit in an IV commit df02d1b1b97896f2def792f8effb91eef2e3f263 Author: Karl Williamson <k...@cpan.org> Date: Mon Jun 25 19:11:46 2018 -0600 Make isUTF8_char() an inline function It was a macro that used a trie. This changes to use the dfa constructed in previous commits. I didn't bother with taking measurements. A dfa should require fewer conditionals to be executed for many code points. commit 9aee7ce99074258cdabfd8964a478733b32e4f9d Author: Karl Williamson <k...@cpan.org> Date: Mon Jun 25 17:01:30 2018 -0600 Extend dfa for translation of UTF-8 to EBCDIC This commit changes to use a dfa for translating from UTF-8 on EBCDIC platforms. This makes for fewer #ifdefs, and I realized while I was working on the dfa, that it wasn't difficult to do for EBCDIC. commit b6cae1a151072e56beec689928bbcebdcc9285f1 Author: Karl Williamson <k...@cpan.org> Date: Mon Jun 25 16:49:22 2018 -0600 Make UTF-8 dfa table an EXTCONST This will allow it to be used inline. commit ea54e1eb56150c605498664c3caea864d678dd61 Author: Karl Williamson <k...@cpan.org> Date: Mon Jun 25 16:16:04 2018 -0600 Rename dfa table for UTF-8 This is in preparation for having additional dfa tables. This names this one to reflect its specific purpose. commit 2d44a56534141a021a6dcaf0ac324a1332b9d6d8 Author: Karl Williamson <k...@cpan.org> Date: Mon Jun 25 16:05:39 2018 -0600 Change dfa table for Perl extended UTF-8 This restructures the dfa table for translating UTF-8 into U32 to handle higher code points. In doing so, I rationalized the numbering scheme for nodes and byte types. This makes it easier to see the patterns in the table. commit d3d766310cd38d25c772fc95e6edc6afaa3a02fc Author: Karl Williamson <k...@cpan.org> Date: Sun Jul 1 13:12:16 2018 -0600 regen/ebcdic.pl: Add capability to generate a dfa table This kind of table is used for the dfa for translating or verifying UTF-8. commit 3e65235e2f1a04ef197657d9beaaff587413f340 Author: Karl Williamson <k...@cpan.org> Date: Sun Jul 1 12:43:59 2018 -0600 regen/ebcdic.pl: Add declaration of generated tables This adds code to declare and define the tables only under DOINIT, and otherwise to just declare them. This allows the includer to not have to deal with them at all. commit c4306170c58ed3b027437608fc39051dcfbd1bd2 Author: Karl Williamson <k...@cpan.org> Date: Sun Jun 10 12:18:44 2018 -0600 ebcdic_tables.h: Add comments commit 5ad011e5b726e0725b0921ba2937da2ce79cdca0 Author: Karl Williamson <k...@cpan.org> Date: Thu Jun 14 13:35:39 2018 -0600 utf8.c: Change expression to be EBCDIC friendly This actually does two things: 1) it adds macros that evaluate to no extra code on ASCII platforms, but allow things to work under EBCDIC; and 2) it changes to use a ternary conditional. This may not change anything, or it may cause the compiler to generate slightly smaller code at the expense of an extra addition instruction. I am moving to inlining this code, and want to make it smaller to enable that to happen. commit e118c58aa39a4cb523e1bbe1b75f9ec97379e32c Author: Karl Williamson <k...@cpan.org> Date: Mon Jun 11 12:58:25 2018 -0600 utf8.h: Add assert for utf8n_to_uvchr_buf() The Perl_utf8n_to_uvchr_buf() version of this function has an assert; this adds it as well to the macro that bypasses the function. commit 376461bff4e33e75e117b8f62f99fe98de68812f Author: Karl Williamson <k...@cpan.org> Date: Mon Jun 11 13:26:24 2018 -0600 perl.h: Add parens around macro arguments Arguments used within macros need to be parenthesized in case they are called with an expression. This commit changes _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG() to do that. commit 20d7f6fd3371de318bc239f5e239b2575db999a5 Author: Karl Williamson <k...@cpan.org> Date: Mon Jun 11 13:28:53 2018 -0600 regexec.c: Call macro with correct args. The second argument to this macro is a pointer to the end, as opposed to a length. commit 2133a50fa6ca651a3a472b73f30a926e6e6c5c7e Author: Karl Williamson <k...@cpan.org> Date: Tue Dec 26 18:09:08 2017 -0700 XXX don't push, khw customization for bench.pl ----------------------------------------------------------------------- -- Perl5 Master Repository