[perl.git] branch smoke-me/khw-dfa created. v5.29.0-41-g4495efbf31

Karl Williamson Sun, 01 Jul 2018 15:14:04 -0700

In perl.git, the branch smoke-me/khw-dfa has been created

<https://perl5.git.perl.org/perl.git/commitdiff/4495efbf31f351740b82500ee9017c954506c7aa?hp=0000000000000000000000000000000000000000>


        at  4495efbf31f351740b82500ee9017c954506c7aa (commit)

- Log -----------------------------------------------------------------
commit 4495efbf31f351740b82500ee9017c954506c7aa
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Jul 1 16:00:41 2018 -0600

    Make isC9_STRICT_UTF8_CHAR() an inline dfa
    
    This replaces a complicated trie with a dfa.  This should cut down the
    number of conditionals encountered in parsing many code points.

commit de2aa591cbecf9ede3ec9a270efdfef9641926d2
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Jun 27 22:01:53 2018 -0600

    Make isSTRICT_UTF8_CHAR() an inline function
    
    It was a macro that used a trie.  This changes to use the dfa
    constructed in previous commits.  I didn't bother with taking
    measurements.  A dfa should have fewer conditionals for many code
    points.

commit 9d19b998b8703ead8851838cde6fb568f3df4e53
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Jun 27 21:52:47 2018 -0600

    Inline dfa for translating from UTF-8
    
    This commit inlines the simple portion of the dfa that translates from
    UTF-8 to code points, used in functions like utf8_to_uvchr_buf.
    
    This dfa has been changed in previous commits so that it is small, and
    punts on any problematic input, plus 18% of the Hangul syllable code
    points.  (These still come out faster than blead.)  The smallness allows
    it to be inlined, adding <2000 total bytes to the perl text space.
    
    The inlined part never calls anything that needs thread context, so that
    parameter can be removed.  I decided to remove it also from the
    Perl_utf8_to_uvchr_buf() and Perl_utf8n_to_uvchr_error() functions.
    There is a small risk that someone is actually using those functions
    instead of the documented macros utf8_to_uvchr_buf() and
    utf8n_to_uvchr_error().  If so, this can be added back in.
    
    Perl_utf8_to_uvchr_msgs() is entirely removed, but the macro
    utf8_to_uvchr_msgs() which is the normal interface to it is retained
    unchanged, and it is marked as unstable anyway.
    
    This change decreases the number of conditional branches in the Perl
    statement
    
        my $a = ord("\x{foo}")
    
    where foo is a non-problematic code point by about 11%, except for
    ASCII characters, where it is 4%, and those Hangul syllables mentioned
    above, where it is 7%.  Problematic code points fare much worse here
    than in blead.  These are the surrogates, non-characters, and
    non-Unicode code points.  We don't care very much about the speed of
    handling these code points, which are mostly considered illegal by
    Unicode anyway.
    
    The percentage decrease is higher for the just the function itself, as
    the measured Perl statement has unchanged overhead.
    
    Here are the annotated benchmarks:
    
    Key:
        Ir   Instruction read
        Dr   Data read
        Dw   Data write
        COND conditional branches
        IND  indirect branches
        _m   branch predict miss
        _m1  level 1 cache miss
        _mm  last cache (e.g. L3) miss
        -    indeterminate percentage (e.g. 1/0)
    
    The numbers represent raw counts per loop iteration.
    
    translate_utf8_to_uv_007f
    my $a = ord("\x{007f}")
    
           blead   dfa Ratio %
           ----- ----- -------
        Ir 395.0 370.0   106.8
        Dr 122.0 115.0   106.1
        Dw  71.0  61.0   116.4
      COND  49.0  47.0   104.3
       IND   5.0   5.0   100.0
    
    In all the measurements, the indirect numbers were all zeros and
    unchanged, and are omitted in this message.
    
    translate_utf8_to_uv_07ff
    my $a = ord("\x{07ff}")
    
           blead   dfa Ratio %
           ----- ----- -------
        Ir 438.0 390.0   112.3
        Dr 128.0 118.0   108.5
        Dw  71.0  61.0   116.4
      COND  57.0  51.0   111.8
       IND   5.0   5.0   100.0
    
    translate_utf8_to_uv_cfff
    my $a = ord("\x{cfff}")
    
    This is the highest Hangul syllable that gets the full reduction.
    
           blead   dfa Ratio %
           ----- ----- -------
        Ir 457.0 410.0   111.5
        Dr 131.0 121.0   108.3
        Dw  71.0  61.0   116.4
      COND  61.0  55.0   110.9
       IND   5.0   5.0   100.0
    
    translate_utf8_to_uv_d000
    my $a = ord("\x{d000}")
    
    This is the lowest affected Hangul syllable
    
           blead   dfa Ratio %
           ----- ----- -------
        Ir 457.0 443.0   103.2
        Dr 131.0 132.0    99.2
        Dw  71.0  71.0   100.0
      COND  61.0  57.0   107.0
       IND   5.0   5.0   100.0
    
    translate_utf8_to_uv_d7ff
    my $a = ord("\x{d7ff}")
    
    This is the highest affected Hangul syllable
    
           blead   dfa Ratio %
           ----- ----- -------
        Ir 457.0 443.0   103.2
        Dr 131.0 132.0    99.2
        Dw  71.0  71.0   100.0
      COND  61.0  57.0   107.0
       IND   5.0   5.0   100.0
    
    translate_utf8_to_uv_d800
    my $a = ord("\x{d800}")
    
    This is a surrogate, showing much worse performance, but we don't care
    
           blead   dfa Ratio %
           ----- ----- -------
        Ir 457.0 515.0    88.7
        Dr 131.0 134.0    97.8
        Dw  71.0  73.0    97.3
      COND  61.0  75.0    81.3
       IND   5.0   5.0   100.0
    
    translate_utf8_to_uv_fdd0
    my $a = ord("\x{fdd0}")
    
    This is a non-char, showing much worse performance, but we don't care
    
           blead   dfa Ratio %
           ----- ----- -------
        Ir 457.0 548.0    83.4
        Dr 131.0 139.0    94.2
        Dw  71.0  73.0    97.3
      COND  61.0  81.0    75.3
       IND   5.0   5.0   100.0
    
    translate_utf8_to_uv_fffd
    my $a = ord("\x{fffd}")
    
           blead   dfa Ratio %
           ----- ----- -------
        Ir 457.0 410.0   111.5
        Dr 131.0 121.0   108.3
        Dw  71.0  61.0   116.4
      COND  61.0  55.0   110.9
       IND   5.0   5.0   100.0
    
    translate_utf8_to_uv_ffff
    my $a = ord("\x{ffff}")
    
    This is another non-char, showing much worse performance, but we don't
    care
    
           blead   dfa Ratio %
           ----- ----- -------
        Ir 457.0 548.0    83.4
        Dr 131.0 139.0    94.2
        Dw  71.0  73.0    97.3
      COND  61.0  81.0    75.3
       IND   5.0   5.0   100.0
    
    translate_utf8_to_uv_1fffd
    my $a = ord("\x{1fffd}")
    
           blead   dfa Ratio %
           ----- ----- -------
        Ir 476.0 430.0   110.7
        Dr 134.0 124.0   108.1
        Dw  71.0  61.0   116.4
      COND  65.0  59.0   110.2
       IND   5.0   5.0   100.0
    
    translate_utf8_to_uv_10fffd
    my $a = ord("\x{10fffd}")
    
           blead   dfa Ratio %
           ----- ----- -------
        Ir 476.0 430.0   110.7
        Dr 134.0 124.0   108.1
        Dw  71.0  61.0   116.4
      COND  65.0  59.0   110.2
       IND   5.0   5.0   100.0
    
    translate_utf8_to_uv_110000
    my $a = ord("\x{110000}")
    
    This is a non-Unicode code point, showing much worse performance, but we
    don't care
    
           blead   dfa Ratio %
           ----- ----- -------
        Ir 476.0 544.0    87.5
        Dr 134.0 137.0    97.8
        Dw  71.0  73.0    97.3
      COND  65.0  81.0    80.2
       IND   5.0   5.0   100.0

commit 0637ed2cfc1d49aaafc4db9d57d7721e8ab492a0
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Jun 27 21:28:15 2018 -0600

    utf8.c: Avoid unnecessary work xlating utf8 to uv
    
    This moves the code for the dfa that does the translation of
    non-problematic characters to earlier in the function to avoid work that
    only needs to be done if the dfa rejects the input.  For example,
    calculating how long the sequence is needed to be no longer is done
    unless the dfa rejects.
    
    Since the dfa always accepts an invariant if the allowed length is
    non-zero, the code that tests for those specifically can be removed.

commit f0c8c77ab9ae7e0a7831cb62fa7593c435c72362
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Jun 27 18:19:54 2018 -0600

    Use strict dfa to translate from UTF-8 to code point
    
    With this commit, if a sequence passes the dfa, the result can be
    returned immediately.  Previously some rare potentially problematic
    sequences could pass, which would then need further checking, which then
    have to be done always.  So this speeds up the general case.

commit 37de8f890719ccbfbbd6cdd977fd74aadc260c6a
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Jun 27 18:08:12 2018 -0600

    Add dfa for strict translation from UTF-8

commit 3dc960ea30f42e63176f2a6ddecbd7f691099651
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Jul 1 13:48:34 2018 -0600

    Fix outdated docs for isUTF8_char()
    
    It doesn't accept non-negative code points that don't fit in an IV

commit df02d1b1b97896f2def792f8effb91eef2e3f263
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Jun 25 19:11:46 2018 -0600

    Make isUTF8_char() an inline function
    
    It was a macro that used a trie.  This changes to use the dfa
    constructed in previous commits.  I didn't bother with taking
    measurements.  A dfa should require fewer conditionals to be executed
    for many code points.

commit 9aee7ce99074258cdabfd8964a478733b32e4f9d
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Jun 25 17:01:30 2018 -0600

    Extend dfa for translation of UTF-8 to EBCDIC
    
    This commit changes to use a dfa for translating from UTF-8 on EBCDIC
    platforms.  This makes for fewer #ifdefs, and I realized while I was
    working on the dfa, that it wasn't difficult to do for EBCDIC.

commit b6cae1a151072e56beec689928bbcebdcc9285f1
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Jun 25 16:49:22 2018 -0600

    Make UTF-8 dfa table an EXTCONST
    
    This will allow it to be used inline.

commit ea54e1eb56150c605498664c3caea864d678dd61
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Jun 25 16:16:04 2018 -0600

    Rename dfa table for UTF-8
    
    This is in preparation for having additional dfa tables.  This names
    this one to reflect its specific purpose.

commit 2d44a56534141a021a6dcaf0ac324a1332b9d6d8
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Jun 25 16:05:39 2018 -0600

    Change dfa table for Perl extended UTF-8
    
    This restructures the dfa table for translating UTF-8 into U32 to handle
    higher code points.  In doing so, I rationalized the numbering scheme
    for nodes and byte types.  This makes it easier to see the patterns in
    the table.

commit d3d766310cd38d25c772fc95e6edc6afaa3a02fc
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Jul 1 13:12:16 2018 -0600

    regen/ebcdic.pl: Add capability to generate a dfa table
    
    This kind of table is used for the dfa for translating or verifying
    UTF-8.

commit 3e65235e2f1a04ef197657d9beaaff587413f340
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Jul 1 12:43:59 2018 -0600

    regen/ebcdic.pl: Add declaration of generated tables
    
    This adds code to declare and define the tables only under DOINIT, and
    otherwise to just declare them.  This allows the includer to not have to
    deal with them at all.

commit c4306170c58ed3b027437608fc39051dcfbd1bd2
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Jun 10 12:18:44 2018 -0600

    ebcdic_tables.h: Add comments

commit 5ad011e5b726e0725b0921ba2937da2ce79cdca0
Author: Karl Williamson <k...@cpan.org>
Date:   Thu Jun 14 13:35:39 2018 -0600

    utf8.c: Change expression to be EBCDIC friendly
    
    This actually does two things: 1) it adds macros that evaluate to no
    extra code on ASCII platforms, but allow things to work under EBCDIC;
    and 2) it changes to use a ternary conditional.  This may not change
    anything, or it may cause the compiler to generate slightly smaller
    code at the expense of an extra addition instruction.  I am moving to
    inlining this code, and want to make it smaller to enable that to
    happen.

commit e118c58aa39a4cb523e1bbe1b75f9ec97379e32c
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Jun 11 12:58:25 2018 -0600

    utf8.h: Add assert for utf8n_to_uvchr_buf()
    
    The Perl_utf8n_to_uvchr_buf() version of this function has an assert;
    this adds it as well to the macro that bypasses the function.

commit 376461bff4e33e75e117b8f62f99fe98de68812f
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Jun 11 13:26:24 2018 -0600

    perl.h: Add parens around macro arguments
    
    Arguments used within macros need to be parenthesized in case they are
    called with an expression.  This commit changes
    _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG() to do that.

commit 20d7f6fd3371de318bc239f5e239b2575db999a5
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Jun 11 13:28:53 2018 -0600

    regexec.c: Call macro with correct args.
    
    The second argument to this macro is a pointer to the end, as opposed to
    a length.

commit 2133a50fa6ca651a3a472b73f30a926e6e6c5c7e
Author: Karl Williamson <k...@cpan.org>
Date:   Tue Dec 26 18:09:08 2017 -0700

    XXX don't push, khw customization for bench.pl

-----------------------------------------------------------------------

-- 
Perl5 Master Repository

[perl.git] branch smoke-me/khw-dfa created. v5.29.0-41-g4495efbf31

Reply via email to