In perl.git, the branch smoke-me/khw-variant has been created

<https://perl5.git.perl.org/perl.git/commitdiff/732d8ba00972ddce47595a7978fa0046fc553e43?hp=0000000000000000000000000000000000000000>

        at  732d8ba00972ddce47595a7978fa0046fc553e43 (commit)

- Log -----------------------------------------------------------------
commit 732d8ba00972ddce47595a7978fa0046fc553e43
Author: Karl Williamson <[email protected]>
Date:   Sun Jan 28 10:02:11 2018 -0700

    Don't use variant_byte_number on MSVC6
    
    See [perl #132766]

commit a2e76eafc6b4443147adfa08d5acfe2f58cefd21
Author: Karl Williamson <[email protected]>
Date:   Thu Jan 25 10:37:04 2018 -0700

    inline.h: Clarify comment

commit 040ff732eea2339831496dbcc776cbfde8ebe94f
Author: Karl Williamson <[email protected]>
Date:   Thu Jan 25 10:25:27 2018 -0700

    Don't use C99 ULL constant suffix
    
    The suffix ULL in, e.g., 7ULL, is C99, and since perl supports C89, we
    can't use it.  Change these occurrences to wrap those that would exceed
    32 bits to use UINTMAX_C(...).
    
    perl.h has logic to define that macro appropriately if the compiler
    doesn't already know it.

commit 30b5ee6936fa286c8ea9fb56822b5758a75701f5
Author: Karl Williamson <[email protected]>
Date:   Tue Dec 26 18:25:26 2017 -0700

    regexec.c: Replace loop by memchr()
    
    This can be called on a potentially long string.

commit 042197bd8afef1622fc0d1393f0848606ba6627f
Author: Karl Williamson <[email protected]>
Date:   Fri Dec 29 15:45:38 2017 -0700

    Use word-at-a-time to repeat /i single byte pattern
    
    For most of the case folding pairs, like [Aa], it is possible to use a
    mask to match them word-at-a-time in regrepeat(), so that long sequences
    of them are handled with significantly better performance.

commit 98ed5d30cff4c8f99ed22f2bc5bb8c1f5e168e0b
Author: Karl Williamson <[email protected]>
Date:   Fri Dec 29 15:17:41 2017 -0700

    regexec.c: Use word-at-a-time to repeat a single byte pattern
    
    There is special code in the function regrepeat() to handle instances
    where the pattern to repeat is a single byte.  These all can be done
    word-at-a-time to significantly increase the performance of long
    repeats.

commit 06b92434c9984a2ad9e4e1fd46196c05cea62d25
Author: Karl Williamson <[email protected]>
Date:   Tue Dec 5 09:30:49 2017 -0700

    maybe drop: avoid ifs

commit 9f66c8231543f01a8f7e71c34cae1fc060dc1198
Author: Karl Williamson <[email protected]>
Date:   Thu Dec 7 22:11:03 2017 -0700

    regcomp.c: Use some of the new typedefs
    
    This just makes sure the concept works.

commit 500790c33a02d90952a8386beb4b284cefde7397
Author: Karl Williamson <[email protected]>
Date:   Wed Dec 6 22:32:14 2017 -0700

    toke.c: Use valid_utf8_length()

commit 6b6a02853f74eb19065acd4d52185b75c646fa4f
Author: Karl Williamson <[email protected]>
Date:   Wed Dec 6 22:29:36 2017 -0700

    regexec.c: Use valid_utf8_length()

commit 8b5bab7778010f82975a8a08def6a8067f79b073
Author: Karl Williamson <[email protected]>
Date:   Wed Dec 6 22:29:22 2017 -0700

    regcomp.c: Use valid_utf8_length()

commit 4f16d8d5a24e001b2884bed5d205072978561250
Author: Karl Williamson <[email protected]>
Date:   Wed Dec 6 22:25:37 2017 -0700

    ext/B: Use valid_utf8_length()

commit 4db060d1c5314261003a012c51acfe8a54b34adb
Author: Karl Williamson <[email protected]>
Date:   Wed Dec 6 22:24:19 2017 -0700

    mg.c: Use valid_utf8_length()

commit 447066a64e848b2f75781c29a8e1e915c52aafc0
Author: Karl Williamson <[email protected]>
Date:   Wed Dec 6 22:23:54 2017 -0700

    mg.h: Use valid_utf8_length()

commit 4b562bc4693e001c14fca5b8944aea580f1ddc1d
Author: Karl Williamson <[email protected]>
Date:   Wed Dec 6 22:21:41 2017 -0700

    sv.h: Use valid_utf8_length()

commit 0a719deb3728a341156b2eb33f7b11a4ad331c90
Author: Karl Williamson <[email protected]>
Date:   Wed Dec 6 22:20:33 2017 -0700

    pp_sys.c: Use valid_utf8_length()

commit 8ce7bee502c7e6fa4dae54a0d85019bf028fcfb8
Author: Karl Williamson <[email protected]>
Date:   Wed Dec 6 22:19:36 2017 -0700

    pp_pack.c: Use valid_utf8_length()

commit 703992913f31d1c9b4b7f9168b55eda7c6e85541
Author: Karl Williamson <[email protected]>
Date:   Wed Dec 6 22:19:03 2017 -0700

    sv.c: Use valid_utf8_length()

commit bea4c5c9d5fa5822aadfaae1208c968de4020abd
Author: Karl Williamson <[email protected]>
Date:   Wed Dec 6 22:18:14 2017 -0700

    pp_split() Use valid_utf8_length()

commit e3bd8b5925379a31386109fce7ce293f0ba6e208
Author: Karl Williamson <[email protected]>
Date:   Mon Dec 4 18:27:17 2017 -0700

    XXX need 32-bit results Add core function valid_utf8_length()
    
    XXX maybe don't do on 32 bit machines
    
    This function is like utf8_length() but assumes that the input is valid
    UTF-8 and uses a different algorithm which does counting word-at-a-time,
    very much like variant_under_utf8_count(), leading to significant
    performance improvements, with longer strings getting more relative
    improvement.
    
    The performance improvement is highly data-dependent, and in fact is
    worse than the current method for very large code points which require
    more bytes to represent them than the platform's word length.  This is
    because the current algorithm skips all the continuation bytes, so it
    may end up skipping more than a words-length, whereas the new algorithm
    examines each word.  But these code points are not legal Unicode, and
    we should consider only legal Unicode when doing optimizations.  And
    legal Unicode has significant performance improvements.
    XXX
    having longer
    On a 32-bit system, the number of failed branch predictions declines to
    half as many, with everything else staying about equal.
    
    32-bit UV's; string length 24 characters; 2 bytes per character
    
           bytecount wordcount
           --------- ---------
        Ir    100.00    100.72
        Dr    100.00    100.82
        Dw    100.00    101.10
      COND    100.00    100.00
       IND    100.00    100.00
    
    COND_m    100.00    200.00
     IND_m    100.00    100.00
    
     Ir_m1    100.00    100.00
     Dr_m1    100.00    100.00
     Dw_m1    100.00    100.00
    
     Ir_mm    100.00    100.00
     Dr_mm    100.00    100.00
     Dw_mm    100.00    100.00
    
    The results are similar for longer strings, and for code points
    represented by different numbers of bytes.
    
    The results on a 64-bit platform also have the branch prediction improve
    by 200%, but at some short string lengths, the number of branches
    worsens slightly:
    
    64-bit UV's; string length 4 characters; 3 bytes per character
    
            byteutf8_length wordutf8_length
            --------------- ---------------
         Ir          100.00           96.08
         Dr          100.00          100.88
         Dw          100.00          100.00
       COND          100.00           97.24
        IND          100.00          100.00
    
     COND_m          100.00          200.00
      IND_m          100.00          100.00
    
      Ir_m1          100.00          100.00
      Dr_m1          100.00          100.00
      Dw_m1          100.00          100.00
    
      Ir_mm          100.00          100.00
      Dr_mm          100.00          100.00
      Dw_mm          100.00          100.00
    
    For longer strings things improve:
    
    64-bit UV's; string length 24 characters; 2 bytes per character
    
            byteutf8_length wordutf8_length
            --------------- ---------------
         Ir          100.00          103.97
         Dr          100.00          112.35
         Dw          100.00          100.00
       COND          100.00          110.27
        IND          100.00          100.00
    
     COND_m          100.00          300.00
      IND_m          100.00          100.00
    
      Ir_m1          100.00          100.00
      Dr_m1          100.00          100.00
      Dw_m1          100.00          100.00
    
      Ir_mm          100.00          100.00
      Dr_mm          100.00          100.00
      Dw_mm          100.00          100.00
    
     64-bit UV's; string length 24 characters; 3 bytes per character
    
            byteutf8_length wordutf8_length
            --------------- ---------------
         Ir          100.00           99.73
         Dr          100.00          111.37
         Dw          100.00          100.00
       COND          100.00          108.05
        IND          100.00          100.00
    
     COND_m          100.00          150.00
      IND_m          100.00          100.00
    
      Ir_m1          100.00          100.00
      Dr_m1          100.00          100.00
      Dw_m1          100.00          100.00
    
      Ir_mm          100.00          100.00
      Dr_mm          100.00          100.00
      Dw_mm          100.00          100.00
    
    At very long strings
    
     64-bit UV's; string length 10000000 characters; 2 bytes per character
    
            byteutf8_length wordutf8_length
            --------------- ---------------
         Ir          100.00          160.00
         Dr          100.00          799.91
         Dw          100.00          100.00
       COND          100.00          399.98
        IND          100.00          100.00
    
     COND_m          100.00          150.00
      IND_m          100.00          100.00
    
      Ir_m1          100.00          100.00
      Dr_m1          100.00          100.00
      Dw_m1          100.00          100.00
    
      Ir_mm          100.00          100.00
      Dr_mm          100.00          100.00
      Dw_mm          100.00          100.00
    
    Performance actually worsens on strings with code points that occupy 7
    or 13 bytes per code point.  These are not in common use, as the maximum
    that Unicode recognizes occupies 4 bytes.

commit b15a82582ba2357b51486e8b46a9c8cc05638fb3
Author: Karl Williamson <[email protected]>
Date:   Tue Dec 5 13:53:15 2017 -0700

    Debug for EBCDIC

commit e809df6db88019b14478f76fa56633e531a01d72
Author: Karl Williamson <[email protected]>
Date:   Wed Nov 22 23:10:01 2017 -0700

    S_multiconcat() Use faster variant counting

commit a3350bbeef293a69c4477057633b89f411f51ea9
Author: Karl Williamson <[email protected]>
Date:   Wed Nov 22 23:12:37 2017 -0700

    toke.c: lex_stuff_pvn() Use faster UTF-8 variant count

-----------------------------------------------------------------------

-- 
Perl5 Master Repository

Reply via email to