In perl.git, the branch smoke-me/khw-optimizer has been created
<http://perl5.git.perl.org/perl.git/commitdiff/b794bcd085f94e06d38f128719fd5fae9194127c?hp=0000000000000000000000000000000000000000>
at b794bcd085f94e06d38f128719fd5fae9194127c (commit)
- Log -----------------------------------------------------------------
commit b794bcd085f94e06d38f128719fd5fae9194127c
Author: Karl Williamson <[email protected]>
Date: Mon Sep 23 10:43:31 2013 -0600
XXX empty string
M regcomp.c
M regcomp.h
commit c21e2680001dfb6c28e5058e0f2eda8afb29adea
Author: Karl Williamson <[email protected]>
Date: Sun Sep 22 23:12:20 2013 -0600
regcomp.c: White-space, comments only
This moves the static functions introduced a few commits ago to more
logical places in the file, and wraps some long lines to 79 columns, and
a few nits in comments
M regcomp.c
commit 5cfad46efd0a4e681171217d32287c2df2d41587
Author: Karl Williamson <[email protected]>
Date: Sun Sep 22 22:56:20 2013 -0600
regcomp.c: Remove unused parameter in static function
This parameter is no longer used, since a few commits ago in this
series.
M embed.fnc
M embed.h
M proto.h
M regcomp.c
commit 20d83bc240bec70cde1b371e05204319df7c9931
Author: Karl Williamson <[email protected]>
Date: Sun Sep 22 22:46:10 2013 -0600
Add some tests for the regex optimizer
We don't have the infrastructure to test the regex optimizer, and I'm
not sure how to do it properly, without tying the tests to particular
optimizations. What I did, however, was to go through the recently
changed optimizer code and write tests to exercise every branch, as far
as I could tell.
M t/re/pat_advanced.t
M t/re/re_tests
commit f82b3f203107f8646d04c25a9e701b1b40e64a62
Author: Karl Williamson <[email protected]>
Date: Sun Sep 22 22:36:57 2013 -0600
regcomp.c: Tighten optimizer for /li matches
The synthetic start class (ssc) generated by the regex optimizer
frequently has case-sensitive matching enabled, even if nowhere in the
pattern is there a /i. This commit causes any pattern that doesn't have
/i to not have its ssc contain a /i.
M regcomp.c
commit 868b7730b6f72c1dd2c0f72a3c3f95077596fca0
Author: Karl Williamson <[email protected]>
Date: Sun Sep 22 21:36:29 2013 -0600
XXX better msg: Teach regex optimizer to handle above-Latin1
Until this commit, the regular expression optimizer has essentially
punted on above-Latin1 code points. Under some circumstances, they
would be taken into account, more or less. With the advent of inversion
lists which it becomes feasible to actually fully handle them. This
commit changes the optimizer to use inversion lists. This required
rewriting the base level ...
M embed.fnc
M embed.h
M proto.h
M regcomp.c
M regcomp.h
commit 4ccbdbea46c17ab51f9e53c5ad7e02aa1052a6b8
Author: Karl Williamson <[email protected]>
Date: Sun Sep 22 20:43:02 2013 -0600
regcomp.c: Add some static functions
This commit adds some functions that are currently unused, but will be
used in a future commit. This commit is essentially to make the
differences smaller in that commit, as 'diff' is getting confused and
not outputting the logical differences. The functions are added in a
block at the beginning of the file to avoid the 'diff' issues. A later
white-space only commit will move them to more appropriate positions.
M embed.fnc
M embed.h
M proto.h
M regcomp.c
M regcomp.h
commit 85247e8c6ce5e7c3e1daea68b3067fbb6665055e
Author: Karl Williamson <[email protected]>
Date: Mon Sep 9 20:33:48 2013 -0600
regcomp.c: Use macro accessor uniformly
These instances were using the structure field directly; everywhere else
uses a macro that hides the field's location in the structure. This
converts to use the macro everywhere.
M regcomp.c
commit 3d0a2af081a6a9be954c25a8745e50b53b7533d2
Author: Karl Williamson <[email protected]>
Date: Sat Sep 14 19:03:39 2013 -0600
regcomp.c: Optimize e.g. /[\w\W]/l into dot
This is an unlikely scenario for someone to include a Posix class and
its complement in the same bracketed character class, but looking for
this and optimizing it away helps the algorithm coming in a future
commit to look at the synthetic start class.
This commit only does this for /l matching. For all other matching, if
we know at compile time what the posix classes match, this optimization
is already done.
M regcomp.c
commit d67d6aba74068b404d9d4b1ec1fd81b8a0d44af4
Author: Karl Williamson <[email protected]>
Date: Thu Sep 5 22:40:54 2013 -0600
Enlarge dummy regex pass1 compilation node
In pass 1 of compiling regular expressions, the needed size is
calculated. There is space allocated for a scratch node that can be
used for the things that the real one will hold in pass 2. It is valid
only while working on the current node, and gets overwritten in the next
node.
Until this commit, this scratch space was sized only for the smallest
node type, meaning that larger types could not use it for scratch. Now
it is sized to be the largest non EXACTish node.
We could make it an array of 256 + overhead bytes instead to be able to
hold the EXACTish nodes, but I don't see a need for that now.
M regcomp.c
M regcomp.h
commit 9bb000f391b58fd5090a4a3799ff48885053dc2f
Author: Karl Williamson <[email protected]>
Date: Thu Aug 15 15:27:08 2013 -0600
regcomp.c: Use STR_WITH_LEN to avoid bookkeeping
By changing the order of the parameters to the static function
S_add_data, we can call it with STR_WITH_LEN and avoid a human having to
count characters.
M embed.fnc
M proto.h
M regcomp.c
commit 366cbc7128f8e1f09b4f350285da58772310a7e2
Author: Karl Williamson <[email protected]>
Date: Thu Aug 15 15:07:44 2013 -0600
Rename regex flag bit for clarity
ANYOF_UNICODE_ALL doesn't mean every Unicode code point. It means those
above the Latin1 range. Rename it, while retaining the old one for back
compat.
M regcomp.c
M regcomp.h
M regexec.c
commit 26a26e949e92f435035962ace1377e690785b61a
Author: Karl Williamson <[email protected]>
Date: Thu Aug 15 14:55:16 2013 -0600
regcomp.c: Better DEBUGGING builds error detection
The code had a default: catch-all in the switch statement, but the
comments indicated that it was uncertain what all was being caught.
This changes this to panic only in DEBUGGING builds so that we can find
out if there are indeed other possibilities that we haven't handled, and
which could use better handling than the default, match everything.
The two known possibilities are given separate case: statements in
preparation for handling them differently.
M regcomp.c
commit 8b80a07cd4895da4d88491ea86b95e74d899d1c7
Author: Karl Williamson <[email protected]>
Date: Thu Aug 15 14:49:37 2013 -0600
regcomp.c: Change some static parameters to const
I found I needed const in a planned future commit.
M embed.fnc
M proto.h
M regcomp.c
commit 099022a31579d3730c00f5d42b70ba27b54c1ed1
Author: Karl Williamson <[email protected]>
Date: Thu Aug 15 14:27:53 2013 -0600
Retain an inversion list's mortality in its replacement
A couple of inversion list handling functions end up sometimes creating
a new inversion list, replacing the old one instead of modifying it.
This commit causes the replacement list to have the same mortality or
not of the old one. That is, mortality is now preserved across these
operations.
M regcomp.c
commit d17c711349542d42d498bc1c87ed0e166b967f87
Author: Karl Williamson <[email protected]>
Date: Thu Aug 15 14:04:43 2013 -0600
perl.c: Clean up some SV*s at termination
These were omitted from cleaning up when PERL_DESTRUCT_LEVEL is non-zero
M perl.c
commit efa5f7bfd8a5d1c1ce259afd297d067547bf029b
Author: Karl Williamson <[email protected]>
Date: Thu Aug 15 11:19:02 2013 -0600
regcomp.c: Add parameter to static function
This parameter will be used in future commits. This commit is really
only to make the difference listing smaller in those, by committing
separately just the book-keeping parts. This parameter requires also
passing the aTHX_ thread parameter
M embed.fnc
M embed.h
M proto.h
M regcomp.c
commit ee196112fabc480ad38e0a4af9928ab47401cb9e
Author: Karl Williamson <[email protected]>
Date: Thu Aug 15 10:59:01 2013 -0600
Remove PL_ASCII; use existing array slots for it
PL_ASCII contains an inversion list to match the ASCII-range code
points. It is unusable outside the core regular expression code because
all the functions that manipulate inversion lists are defined only
within a few core files. Therefore no outside code should be depending
on it.
It turns out that there are arrays of similar inversion lists, and these
all have slots which should have this inversion list in them. This
commit fills them, instead of using PL_ASCII.
M embedvar.h
M intrpvar.h
M regcomp.c
M sv.c
commit ff8f9a634ff67c21ac012ee2e97dee96755c67c4
Author: Karl Williamson <[email protected]>
Date: Thu Aug 15 10:51:24 2013 -0600
regcomp.c: Typos in comments; Fix another comment
The non-typo fix is the result of allowing a parameter to the function
be NULL, and not updating the comments to reflect that.
M regcomp.c
commit 3429f4956387d81a34c4d46aa85a80f88d5486d9
Author: Karl Williamson <[email protected]>
Date: Thu Aug 15 10:39:14 2013 -0600
regcomp.c: Fix syntax error in #ifdef'd out code
This line is currently not compiled, but would fail if the #ifdef is
changed.
M regcomp.c
commit d668e9c42669fc86bab20403119576717763ee42
Author: Karl Williamson <[email protected]>
Date: Thu Aug 15 10:36:29 2013 -0600
perl.h: Don't pollute global namespace
These structures are used internally in the regular expression files,
and are declared here only because of #include ordering issues. Wrap
them in an #ifdef so only visible to the correct files.
M perl.h
commit 8a1ba0ca0c8c4316df3fa823593513d0b79611f4
Author: Karl Williamson <[email protected]>
Date: Wed Aug 14 21:13:52 2013 -0600
Make typedef fully typedef
The regcomp.c struct RExC_state_t has not been usable fully as a
typedef, requiring the 'struct' at times. This has caused me, and I
presume others, wasted time when we forget to use it under those
circumstances when it should be used, but it's never been a big enough
issue to cause me to spend tuits on it. But, working on something else,
I finally came to the realization of what the problem is. It is because
proto.h is #included before regcomp.h is, and so functions that are
declared in proto.h that have something that is a RExC_state_t as a
parameter don't know that it is a typedef because that is defined in
regcomp.h. A way around this is already used for other similar
structures, and that is to declare them in perl.h which is always read
in before proto.h, leaving the definitions to regcomp.h. Thus proto.h
knows enough to compile.
The structure was already declared in perl.h; just not typedef'd.
Otherwise proto.h would not know about it at all. This patch moves two
regcomp.c related declarations in perl.h to the same section as the
others, and changes the one for RExC_state_t to be a typedef. All the
'struct' uses are removed.
M embed.fnc
M embed.h
M perl.h
M proto.h
M regcomp.c
commit a4486fbc9820f1a50dd066713bf4aca13db45356
Author: Karl Williamson <[email protected]>
Date: Wed Aug 14 11:39:38 2013 -0600
regcomp.h: Create new typedef synonym for clarity
This commit finishes (at least for now) removing some of the overloading
of the term class. A 'regnode_charclass_class' node contains space for
storing the posix classes it matches that are never defined until the
moment of matching because they are subject to the current run-time
locale. This commit creates a typedef 'regnode_charclass_posixl'
synonym that doesn't re-use the term 'class' for two different purposes.
M perl.h
M pod/perlreguts.pod
M regcomp.h
commit acac108054ebdd8027835e66421d9411dc615f8a
Author: Karl Williamson <[email protected]>
Date: Fri Aug 9 12:21:53 2013 -0600
regcomp.h: Parenthesize macro formal parameter
Not doing so can cause problems, so it is standard procedure to
parenthesize all parameters within a macro definition.
M regcomp.h
commit 8afc12f2bb6e40f4e3956fdb8811af17e9e5669f
Author: Karl Williamson <[email protected]>
Date: Fri Aug 9 11:51:09 2013 -0600
regcomp.h: Add better named synonyms
This continues the process started two commits ago of removing some of
the overloading of the term 'class'.
In this case, this commit adds some #defines referring to the portions
of the regnode associated with bracketed character classes, the ANYOF
node. Specifically those portions that deal with the Posix character
classes, like \w and [:punct:] under /l (locale) matching are renamed
substituting POSIXL for CLASS. POSIXL is already used for POSIX-related
things under /l. I remember being terribly confused when I started
reading this code about this. One had a class within a class. This
should clarify things somewhat.
The old names are retained in case files outside the core #include and
use it (there are a few such in cpan).
M regcomp.c
M regcomp.h
M regexec.c
commit ec7136883bc496b2795d2041f2177f70d46b1f81
Author: Karl Williamson <[email protected]>
Date: Sat Sep 14 18:57:26 2013 -0600
regcomp.c: Clarify comment
This continues the process of removing some overloading of the word
'class', by changing this comment to use 'bracketed class', and
re-wrapping
M regcomp.c
commit f507db14ee54e7e4dae79a556c3c2c02b3e6e392
Author: Karl Williamson <[email protected]>
Date: Tue Aug 6 21:41:53 2013 -0600
regcomp.h: Move #define
This moves it to be adjacent to similar #defines
M regcomp.h
commit 9ecadaf5392557e9465a75a4a1f975156acba661
Author: Karl Williamson <[email protected]>
Date: Wed Aug 14 11:19:18 2013 -0600
regcomp.c: Change names of some static functions
The term 'class' is very overloaded in regex code and documentation.
perlrecharclass.pod calls the dot (matching any char) a class, and
calls the [] form "bracketed character classes". There are other
meanings as well. This is the first commit in a short series that
removes some of those overloadings.
One instance of class is the "synthetic start class", generated by the
regex optimizer to be a list of all the code points a sucessful match
could possibly start with. This is useful in more quickly finding where
to start looking in matching against a target string. Prior to this
commit, the routines that referred to this began with 'cl_', and the
formal parameters were 'cl', which could mean any class. This commit
changes those instances of 'cl' to 'ssc' to indicate this is the only
type of class that is being handled.
M embed.fnc
M embed.h
M proto.h
M regcomp.c
commit b6d5c7d55054188d9cef69a9027b9b67a9eb214a
Author: Karl Williamson <[email protected]>
Date: Wed Aug 14 10:01:53 2013 -0600
regcomp.c: Rework static function call; comments
The previous commit just extracted out code into a function. This
commit renames a parameter for clarity, combines two parameters to make
the interface cleaner, and adds and moves comments around.
M embed.fnc
M embed.h
M proto.h
M regcomp.c
commit 2f8b639a4b2e7f1522c9988bf847c754f0a9f9bb
Author: Karl Williamson <[email protected]>
Date: Wed Aug 14 11:09:58 2013 -0600
regcomp.c: Extract code into separate function
A future commit will use this functionality from another place. For
now, just cut and paste, and do the minimal ancillary work to get it to
compile and pass.
M embed.fnc
M embed.h
M proto.h
M regcomp.c
commit 3aaf80fcbdd646df31d8e354df60457d7985052a
Author: Karl Williamson <[email protected]>
Date: Fri Aug 2 12:33:07 2013 -0600
regcomp.c: Use PL_sv_undef instead of NULL in an AV
The NULL gets turned into an SVt_NULL anyway. This array is read only
by S_core_regclass_swash() in regexec.c. That uses an SvROK, so it
doesn't have to change.
This commit also beefs up the comments around this operation
M regcomp.c
commit df64dd727f6715b3376f32e458137baa86035890
Author: Karl Williamson <[email protected]>
Date: Thu Aug 1 14:49:29 2013 -0600
Add regnode struct for synthetic start class
As part of extending the regular expression optimizer to properly handle
above Latin1 code points, I need an inversion list to contain which code
points the synthetic start class (ssc) matches.
The ssc currently is the same as a locale-aware ANYOF node, which uses
the struct of a regular ANYOF node, plus some extra fields at the end.
This commit creates a new typedef for ssc use, which is the locale-aware
ANYOF node, plus an extra SV* at the end to hold the inversion list.
M embed.fnc
M embed.h
M perl.h
M proto.h
M regcomp.c
M regcomp.h
commit 050cc0bb8e2aa1e1e733371c075c07e42603cf62
Author: Karl Williamson <[email protected]>
Date: Wed Jul 24 19:56:24 2013 -0600
regcomp.c: Move a #define, add a similar one
Future commits will use this #define (and the new one) earlier in the
file than currently defined.
M regcomp.c
commit 30bd78f89852ec7a49cfdd205207470876404fd1
Author: Karl Williamson <[email protected]>
Date: Tue Jul 23 10:01:29 2013 -0600
Add inversion list for U+80 - U+FF
This is the upper half of the Latin1 range. This simplifies some code
very slightly, but will be of use in future commits.
M charclass_invlists.h
M embedvar.h
M intrpvar.h
M regcomp.c
M regen/mk_invlists.pl
M sv.c
commit d88eddc1cc2dfd90b8aeb1e336eeae33b343c3b1
Author: Karl Williamson <[email protected]>
Date: Sun Jul 21 21:13:38 2013 -0600
regcomp.c: Extract code into separate function
This is in preparation for it to be called from more than one place, in
a future commit.
M embed.fnc
M embed.h
M proto.h
M regcomp.c
commit e1c4754b6a4406eb7c1e1c61adb97655b509c889
Author: Karl Williamson <[email protected]>
Date: Sun Jul 21 10:10:56 2013 -0600
regcomp.c: Remove redundant matching possibilities
The flag ANYOF_UNICODE_ALL is for performance. It is set when the
inversion list for the ANYOF node includes every code point above
Latin1, and avoids runtime searching through the list. We don't need
both, as the flag being set short-circuits even looking at the other
list. By removing the code points from the list, we perhaps will get
rid of the list entirely, thus saving some operations, or will shorten
it so that later binary searches run faster.
M regcomp.c
commit 8f7b6077286b6fd0d3453c078ac65f759769ccbe
Author: Karl Williamson <[email protected]>
Date: Sun Jul 21 08:21:34 2013 -0600
regcomp.c: Centralize assignment
It's better to do something in one common place than two. This properly
initializes the regex opcode for the synthetic start class when it is
created, rather than at the end where the code has to be repeated to get
all instances.
M regcomp.c
commit 7aaf241a8df58474fa6de4baddf3ea54c32391e8
Author: Karl Williamson <[email protected]>
Date: Thu Sep 12 19:42:51 2013 -0600
perlreguts: Bring up-to-date
Various changes have been made to regcomp.c that didn't make it into
perlreguts until now.
M pod/perlreguts.pod
commit 9af2b4b1a800380bc554580dc45cb062688a6728
Author: Karl Williamson <[email protected]>
Date: Thu Sep 12 18:03:19 2013 -0600
perlreguts.pod: Nits
M pod/perlreguts.pod
commit c7f276d1e28cfc4b4bcbb047f1e4d15a3eb2ef5a
Author: Karl Williamson <[email protected]>
Date: Sat Sep 14 13:17:21 2013 -0600
regcomp.c: Convert another I32 to SSize_t
This code is normally #ifdef'd out, and so was missed in the earlier
conversions, commit ed56dbcb51c55e631d5f4931f88efe008e5349c4.
M regcomp.c
-----------------------------------------------------------------------
--
Perl5 Master Repository