In perl.git, the branch smoke-me/khw-invlist has been created

<http://perl5.git.perl.org/perl.git/commitdiff/872652fefc7911c8a78fbb00327745dc1ba12d3e?hp=0000000000000000000000000000000000000000>

        at  872652fefc7911c8a78fbb00327745dc1ba12d3e (commit)

- Log -----------------------------------------------------------------
commit 872652fefc7911c8a78fbb00327745dc1ba12d3e
Author: Karl Williamson <[email protected]>
Date:   Tue Dec 6 11:46:02 2011 -0700

    more squishing

M       regcomp.c

commit a5dbd8cfd4d7f9d8c5c9a63e728f3271cdf01088
Author: Karl Williamson <[email protected]>
Date:   Tue Dec 6 10:53:16 2011 -0700

    more temporary commit

M       regcomp.c

commit e0d26c18c4bc45a440f67b8d13fbffe6dbccc9e8
Author: Karl Williamson <[email protected]>
Date:   Tue Dec 6 10:44:27 2011 -0700

    different squish white space

M       regcomp.c

commit 2b722e086ed5e76eb99b83a8bd4723646b067d47
Author: Karl Williamson <[email protected]>
Date:   Tue Dec 6 10:42:27 2011 -0700

    squisch white space

M       regcomp.c

commit 81026b2b306f2e57222eb7b44e99f4be21461e87
Author: Karl Williamson <[email protected]>
Date:   Tue Dec 6 10:27:53 2011 -0700

    squish safefree

M       regcomp.c

commit cbbfd4e27d722552ee9c8e4f27c13ad492791105
Author: Karl Williamson <[email protected]>
Date:   Tue Dec 6 09:37:02 2011 -0700

    regcomp.c: see if avoids embed_thr bug

M       regcomp.c

commit eb6fcaf5a6b64d7752d38e24c452bbee884cc414
Author: Karl Williamson <[email protected]>
Date:   Tue Dec 6 09:21:21 2011 -0700

    squish

M       regcomp.c

commit 77465be60452670fa2e9bfc0007a2433b4f78f2f
Author: Karl Williamson <[email protected]>
Date:   Tue Nov 29 17:54:11 2011 -0700

    utf8.c: Silence compiler warnings
    
    These variables aren't used uninitialized in these functions, but the
    logic might change in the future without considering all the
    consequences, so silence the warnings.

M       utf8.c

commit 02031aaf10412a36c281a005a45d6fbecf82aa53
Author: Karl Williamson <[email protected]>
Date:   Tue Nov 29 14:57:02 2011 -0700

    regcomp.c: Don't read beyond input
    
    This code was assuming that there were several more bytes in the input
    stream, when there may not be.  This was discovered by valgrind.

M       regcomp.c

commit 8791b06c6734cdce650298c7561d5438b533e6b1
Author: Karl Williamson <[email protected]>
Date:   Mon Nov 28 12:32:02 2011 -0700

    regcomp.c: Optimize a single Unicode property in a [character class]
    
    All Unicode properties actually turn into bracketed character classes,
    whether explicitly done or not.  A swash is generated for each property
    in the class.  If that is the only thing not in the class's bitmap, it
    specifies completely the non-bitmap behavior of the class, and can be
    passed explicitly to regexec.c.  This avoids having to regenerate the
    swash.  It also means that the same swash is used for multiple instances
    of a property.  And that means the number of duplicated data structures
    is greatly reduced.  This currently doesn't extend to cases where
    multiple Unicode properties are used in the same class
    [\p{greek}\p{latin}] will not share the same swash as another character
    class with the same components.  This is because I don't know of a
    an efficient method to determine if a new class being parsed has the
    same components as one already generated.  I suppose some sort of
    checksum could be generated, but that is for future consideration.

M       regcomp.c
M       t/lib/warnings/utf8
M       t/uni/cache.t

commit f83a1595024426a9bf4fbb7e62afeb7566bade86
Author: Karl Williamson <[email protected]>
Date:   Mon Nov 28 10:26:28 2011 -0700

    Move Unicode property defn processing to compile time
    
    This patch moves the processing of most Unicode property definitions
    from execution (regexec.c) to compilation (regcomp.c).  There is a cost
    to do this.  By deferring it to execution, it may be that the affected
    patch will never be taken, and hence the work won't have to be done;
    whereas, it's always done if it gets done at compilation.
    
    However, doing it at compilation, has many advantages.  We can't
    optimize what we don't know about, so this allows for better
    optimization, as well as feature enhancements, such as set
    manipulations, restricting matches to certain scripts, etc.  A big one,
    about to be committed allows for significantly reducing the number of
    copies of the data structure used for each property.  (Currently, every
    mention in every regular expression of a given property will generate a
    new instance of its hash, and so results of look-ups of code points in
    one instance aren't automatically known to other instances, so the code
    point has to be looked-up again.)
    
    This commit leaves the processing to execution time when the class is to
    be inverted.  This was done purely to make the commit smaller, and will
    be removed in a future commit; hence the redundant test here will be
    removed shortly.
    
    It also has to leave to execution time processing of properties whose
    definition is not known yet.  That can happen when the property is
    user-defined.  We call _core_swash_init(), and if it fails, we assume
    that it's because it's such a property, and if it turns out that it was
    an unknown property, we leave to execution time the raising of a warning
    for it, just as before.

M       regcomp.c

commit 447dae57c9584dd0ce1e1b3f92324652b27c2231
Author: Karl Williamson <[email protected]>
Date:   Mon Nov 28 09:43:54 2011 -0700

    regcomp.c: Pass inversion list directly to regexec.c
    
    Currently, any generated inversion list is stringified and passed in the
    data structure to regexec.c as such.  regexec.c then calls
    _core_swash_init() to convert it into a swash and back into an inversion
    list.  This intermediate step is wasteful, and this commit dispenses
    with it, based on preparatory commits in regexec.c and utf8.c

M       regcomp.c

commit 3c9dfa4f0e6faea4d2830a338f18f5ff9bfb7e8b
Author: Karl Williamson <[email protected]>
Date:   Mon Nov 28 09:25:45 2011 -0700

    regexec.c: Prepare for inversion lists in ANYOF nodes
    
    Future commits will start passing inversion lists to regexec.c from the
    compilation phase.  This commit causes regexec.c to accept them, trace
    them for debug output, and pass them along to utf8.c

M       regexec.c

commit 47cb59a455b68e0cdb741b5e5a37b8adefb9ef4c
Author: Karl Williamson <[email protected]>
Date:   Mon Nov 28 09:20:12 2011 -0700

    regcomp.c: Add _invlist_contents() to compactly dump inversion list
    
    This will be used in future commits for debug traces

M       embed.fnc
M       embed.h
M       proto.h
M       regcomp.c

commit 039fbcad597e91322f929bc6cdaa32358334a045
Author: Karl Williamson <[email protected]>
Date:   Mon Nov 28 09:00:52 2011 -0700

    utf8.c: White-space only
    
    As a result of previous commits adding and removing if() {} blocks,
    indent and outdent and reflow comments and statements to not exceed 80
    columns.

M       utf8.c

commit 7780f365344a744aaca7f78ceb199c7a193c3a7f
Author: Karl Williamson <[email protected]>
Date:   Mon Nov 28 08:36:54 2011 -0700

    utf8.c: Add ability to pass inversion list to _core_swash_init()
    
    Add a new parameter to _core_swash_init() that is an inversion list to
    add to the swash, along with a boolean to indicate if this inversion
    list is derived from a user-defined property.  This capability will prove
    useful in future commits

M       embed.fnc
M       embed.h
M       proto.h
M       utf8.c

commit cf9fbaa62d8ebb7bc689b9a61bd7e4bdedb82756
Author: Karl Williamson <[email protected]>
Date:   Mon Nov 28 08:24:07 2011 -0700

    utf8.c: Add flag to swash_init() to not croak on error
    
    This adds the capability, to be used in future commits, for swash_ini()
    to return NULL instead of croaking if it can't find a property, so that
    the caller can choose how to handle the situation.

M       embed.fnc
M       embed.h
M       proto.h
M       utf8.c

commit f649cf48e78f8391f580f9664f24373cd50b1d75
Author: Karl Williamson <[email protected]>
Date:   Sun Nov 27 20:55:33 2011 -0700

    regcomp.c: Use '*a == b', not 'a == &b'
    
    The latter doesn't always work.  The consequences of this failure were
    memory leaks

M       regcomp.c

commit 7d99acec7d08718ae4c2d29c5cf5a1c2c1377568
Author: Karl Williamson <[email protected]>
Date:   Sun Nov 27 20:53:56 2011 -0700

    regcomp.c: decrement ptr ref cnt before invalidating ptr
    
    Otherwise there coul be memory leaks

M       regcomp.c

commit bc6fb35b4eabf48168509b437b018a24d9cbbb3a
Author: Karl Williamson <[email protected]>
Date:   Sun Nov 27 20:49:59 2011 -0700

    regcomp.c: Add some assertions
    
    Subsequent code assumes that these are true

M       regcomp.c

commit d733ca4e637b862a05ee0436cf7dc966a904b99e
Author: Karl Williamson <[email protected]>
Date:   Sun Nov 27 15:45:22 2011 -0700

    regcomp.c: Don't overallocate space for cloned SV
    
    The length passed to _new_invlist() is in elements and not bytes, so
    this was overallocating space because the number of bytes is multiplied
    by a platform-dependent value.

M       regcomp.c

commit 88832752f4ce21b7b886742e27184ab0061b1e92
Author: Karl Williamson <[email protected]>
Date:   Sun Nov 27 15:41:26 2011 -0700

    regcomp.c: Make sure invlist_clone length set correctly
    
    The cloned inversion list was getting initialized with sufficient space,
    but because a Copy was used, it did not know how much of that space is
    occupied.  There are no tests, because this was found through valgrind,
    and it otherwise depends on whatever was in the uninitialized data at
    the time

M       regcomp.c

commit 0e5fef5eeb2995ecda11ae35e17bad554acc9026
Author: Karl Williamson <[email protected]>
Date:   Sun Nov 27 15:34:52 2011 -0700

    utf8.c: Prevent reading before buffer start
    
    Make sure there is something before the character being read before
    reading it.

M       utf8.c

commit 230528e3e39f08f8b8ee80cc286988cc65d70ac4
Author: Karl Williamson <[email protected]>
Date:   Sat Nov 26 18:24:46 2011 -0700

    Utf8.c: Generate and use inversion lists for binary swashes
    
    Prior to this patch, every time a code point was matched against a swash,
    and the result was not previously known, a linear search through the
    swash was performed.  This patch changes that to generate an inversion
    list whenever a swash for a binary property is created.  A binary search
    is then performed for missing values.
    
    This change does not have much effect on the speed of Perl's regression
    test suite, but the speed-up in worst-case scenarios is huge.  The
    program at the end of this commit is crafted to avoid the caching that
    hides much of the current inefficiencies.  At character classes of 100
    isolated code points, the new method is about an order of magnitude
    faster; two orders of magnitude at 1000 code points.  The program at the
    end of this commit message took 97s to execute on my box using blead,
    and 1.5 seconds using this new scheme.  I was surprised to see that even
    with classes containing fewer than 10 code points, the binary search
    trumped, by a little, the linear search
    
    Even after this patch, under the current scheme, one can easily run out
    of memory due to the permanent storing of results of swash lookups in
    hashes.  The new search mechanism might be fast enough to enable the
    elimination of that memory usage.  Instead, a simple cache in each
    inversion list that stored its previous result could be created, and
    that checked to see if it's still valid before starting the search,
    under the assumption, which the current scheme also makes, that probes
    will tend to be clustered together, as nearby code points are often in
    the same script.
    ===============================================
     # This program creates longer and longer character class lists while
     # testing code points matches against them.  By adding or subtracting
     # 65 from the previous member, caching of results is eliminated (as of
     # this writing), so this essentially tests for how long it takes to
     # search through swashes to see if a code point matches or not.
    
    use Benchmark ':hireswallclock';
    
    my $string = "";
    my $class_cp = 2**30;   # Divide the code space in half, approx.
    my $string_cp = $class_cp;
    my $iterations = 10000;
    for my $j (1..2048) {
    
        # Append the next character to the [class]
        my $hex_class_cp = sprintf("%X", $class_cp);
        $string .= "\\x{$hex_class_cp}";
        $class_cp -= 65;
    
        next if $j % 100 != 0;  # Only test certain ones
    
        print "$j: lowest is [$hex_class_cp]: ";
    
        timethis(1, "no warnings qw(portable non_unicode);my \$i = $string_cp; 
for (0 .. $iterations) { chr(\$i) =~ /[$string]/; \$i+= 65 }");
        $string_cp += ($iterations + 1) * 65;
    }

M       utf8.c

commit 3102c724f371ed03e1bedaa0a157decaedc09006
Author: Karl Williamson <[email protected]>
Date:   Fri Nov 25 12:59:51 2011 -0700

    regcomp.c:  Add _invlist_populate_swatch()
    
    This function will be used in future commits

M       embed.fnc
M       embed.h
M       proto.h
M       regcomp.c

commit 7f381b5d2844559a8786bd05e3da23bf76d715ed
Author: Karl Williamson <[email protected]>
Date:   Fri Nov 25 12:49:02 2011 -0700

    regcomp.c:  Add invlist_search()
    
    This function does a binary search on an inversion list.  It will be
    used in future commits

M       embed.fnc
M       embed.h
M       proto.h
M       regcomp.c

commit 2ba7a6f05e6153ec831787bcf843cee3eafe3d70
Author: Karl Williamson <[email protected]>
Date:   Wed Nov 23 20:11:26 2011 -0700

    utf8.c: Refactor code slightly in prep
    
    Future commits will split up the necessary initialization into two
    components.  This patch prepares for that without adding anything new.

M       utf8.c

commit 30f3f991d7c77604e7b588969917be469894e513
Author: Karl Williamson <[email protected]>
Date:   Tue Nov 22 13:37:04 2011 -0700

    regcomp.c: Change internal #define name
    
    I have struggled to come up with a good name for this concept; and like
    the new one better than the old

M       regcomp.c

commit bac03619a9241602e705ed528d2c460968d4ed22
Author: Karl Williamson <[email protected]>
Date:   Tue Nov 22 12:06:41 2011 -0700

    utf8.c: New function to retrieve non-copy of swash
    
    Currently, swash_init returns a copy of the swash it finds.  The core
    portions of the swash are read-only, and the non-read-only portions are
    derived from them.  When the value for a code point is looked up, the
    results for it and adjacent code points are stored in a new element,
    so that the lookup never has to be performed again.  But since a copy is
    returned, those results are stored only in the copy, and any other uses
    of the same logical stash don't have access to them, so the lookups have
    to be performed for each logical use.
    
    Here's an example.  If you have 2 occurrences of /\p{Upper}/ in your
    program, there are 2 different swashes created, both initialized
    identically.  As you start matching against code points, say "A" =~
    /\p{Upper}/, the swashes diverge, as the results for each match are
    saved in the one applicable to that match.  If you match "A" in each
    swash, it has to be looked up in each swash, and an (identical) element
    will be saved for it in each swash.  This is wasteful of both time and
    memory.
    
    This patch renames the function and returns the original and not a copy,
    thus eliminating the overhead for stashes accessed through the new
    interface.  The old function name is serviced by a new function which
    merely wraps the new name result with a copy, thus preserving the
    interface for existing calls.
    
    Thus, in the example above, there is only one swash, and matching "A"
    against it results in only one new element, and so the second use will
    find that, and not have to go out looking again.  In a program with lots
    of regular expressions, the savings in time and memory can be quite
    large.
    
    The new name is restricted to use only in regcomp.c and utf8.c (unless
    XS code cheats the preprocessor), where we will code so as to not
    destroy the original's data.  Otherwise, a change to that would change
    the definition of a Unicode property everywhere in the program.
    
    Note that there are no current callers of the new interface; these will
    be added in future commits.

M       embed.fnc
M       embed.h
M       proto.h
M       utf8.c

commit 890da47a017e6dc25c4ee91123ff3e14003e1f0a
Author: Karl Williamson <[email protected]>
Date:   Tue Nov 22 09:18:28 2011 -0700

    utf8_heavy.pl: Add inversion status to cache key
    
    Contrary to what the debug statement said, what is being returned is a
    swash, and that swash is different from one that comes from the same
    file but differs in inversion, and so changing the INVERT_IT element
    messes things up for any existing one.  Heretofore it hasn't mattered
    because the swash returned is always a copy, and so it actually hasn't
    created any problems.  But future commits will stop the copying, so this
    would create problems then.
    
    The file will now have to be re-'do'ne to get an inverted list from it.

M       lib/utf8_heavy.pl

commit f741aeebf37688d1ca136b3c0fa4c4079cb53f2d
Author: Karl Williamson <[email protected]>
Date:   Sun Nov 20 19:10:19 2011 -0700

    uni/cache.t: Fix to handle regex compile time Uni props
    
    Future commits are planned to move the resolution of Unicode properties
    from regex execution time to compile time.  By moving the code into a
    BEGIN block, this .t can now handle both types.  Before this patch, it
    wouldn't show any activity at all if things are done at compile time.

M       t/uni/cache.t

commit a5c941a76d99bf2e5d6732290c7a5901b2447d96
Author: Karl Williamson <[email protected]>
Date:   Sun Nov 20 09:03:26 2011 -0700

    embed.fnc: swash_init() return value should not be ignored
    
    Otherwise can have memory leaks

M       embed.fnc
M       proto.h

commit 7f2f6f677be0834c266c5e955881903cc0d5323e
Author: Karl Williamson <[email protected]>
Date:   Sat Nov 19 16:50:33 2011 -0700

    utf8.c: Change name of static function
    
    This function has always confused me, as it doesn't return a swash, but
    a swatch.

M       embed.fnc
M       embed.h
M       lib/utf8_heavy.pl
M       proto.h
M       utf8.c

commit a8560a7ff321b695d72b1a905a6bd39a5b45d6f5
Author: Karl Williamson <[email protected]>
Date:   Sat Nov 19 14:49:20 2011 -0700

    utf8_heavy: Allow to be called from regcomp.c
    
    Future commits will cause regcomp.c to try to compile user-defined
    properties.  The caller stack is different for this, and there may be a
    package name as well that differs from the existing scheme.  This commit
    allows for this.

M       lib/utf8_heavy.pl

commit 5f451ceeb13dbd70ec6e689e72f340849e33ba22
Author: Karl Williamson <[email protected]>
Date:   Sat Nov 19 14:47:33 2011 -0700

    utf8_heavy: Add DEBUG statement
    
    This helps keep track of the recursion used.

M       lib/utf8_heavy.pl

commit f2a3b6a8ae820ed7ab2591d8dfb15ea1f0bbfc83
Author: Karl Williamson <[email protected]>
Date:   Sat Nov 19 14:41:26 2011 -0700

    utf8.c: Move test out of loops
    
    We set the upper limit of the loops before entering them to the min of
    the two possible limits, thus avoiding a test each time through

M       utf8.c

commit 2d8d82cb26024b2f5fe0426c6805ac4f178a325f
Author: Karl Williamson <[email protected]>
Date:   Sat Nov 19 14:37:35 2011 -0700

    mktables: Add a little stress to the tests
    
    This simply reverses the sort order so that the generated tests
    use the highest ranges instead of the lowest, making it less likely that
    tests will pass by chance; and also increasing performance issues in
    finding matches.

M       lib/unicore/mktables

commit 10af155cdec6bf0115d5c625e843d630351cd3eb
Author: Karl Williamson <[email protected]>
Date:   Sat Nov 19 14:22:00 2011 -0700

    utf8_heavy: Skip unnecessary operations
    
    The mktables generated tables are well-formed, already sorted, and with
    no extra items such as "+utf8::foo".  Thus we don't have to do these
    operations on them, but they are required on user-defined properties,
    and should $list be passed in as a parameter.
    
    This patch moves the code that does this to just the user-defined area

M       lib/utf8_heavy.pl

commit 18df1d63ae1ea9ac4abf3ac099d8986334d370be
Author: Karl Williamson <[email protected]>
Date:   Sat Nov 19 11:17:17 2011 -0700

    uni/class.t: Add test
    
    This new test makes sure that a regular expression that forward
    references a user-defined property works.

M       t/uni/class.t

commit 2caff13f4dd75c306d855c6d118db07a22e3e94b
Author: Karl Williamson <[email protected]>
Date:   Fri Nov 18 10:22:56 2011 -0700

    utf8_heavy: remove unused variable

M       lib/utf8_heavy.pl

commit ecbd26fb74467f76bcff4c91ed9176bd913ef970
Author: Karl Williamson <[email protected]>
Date:   Fri Nov 18 08:36:43 2011 -0700

    Comment additions, typos, white-space.

M       lib/unicore/mktables
M       lib/utf8_heavy.pl
M       regcomp.c
M       regcomp.h
M       utf8.c

commit a0c0a17827a358813012807f994844f593691971
Author: Karl Williamson <[email protected]>
Date:   Wed Nov 16 20:03:35 2011 -0700

    regexec.c: Add some comments to regclass_swash()

M       regexec.c

commit 353fd81465e44bfd8164130e439684d68657f7d4
Author: Karl Williamson <[email protected]>
Date:   Wed Nov 16 09:45:07 2011 -0700

    regexec.c: Remove unnecessary intermediate values

M       regexec.c
-----------------------------------------------------------------------

--
Perl5 Master Repository

Reply via email to