In perl.git, the branch smoke-me/khw-regex has been created
<http://perl5.git.perl.org/perl.git/commitdiff/ff4eb3ec0b27d3e28b7143b80c1952fa0abbdb6c?hp=0000000000000000000000000000000000000000>
at ff4eb3ec0b27d3e28b7143b80c1952fa0abbdb6c (commit)
- Log -----------------------------------------------------------------
commit ff4eb3ec0b27d3e28b7143b80c1952fa0abbdb6c
Author: Karl Williamson <[email protected]>
Date: Wed Aug 1 18:28:59 2012 -0600
regcomp.c: Remove unnecessary variable
This variable was used because another was declared 'register'. But
that declaration was removed making the temporary variable redundant.
M regcomp.c
commit 258a252d88eb165a260bad61e908102294248c67
Author: Karl Williamson <[email protected]>
Date: Wed Aug 1 17:04:13 2012 -0600
regcomp.c: inline trivial static function
M embed.fnc
M proto.h
M regcomp.c
commit 926425d4dcedb8d65a283799902030b7e2da33ba
Author: Karl Williamson <[email protected]>
Date: Wed Aug 1 15:12:23 2012 -0600
regcomp.c: Fix \N{} multi-char fold buffer boundary bug
An earlier commit in this topic branch fixed the bug (for non-\N{})
cases where a multi-character fold could try to span two EXACTFish
nodes, where they are split because the first one would otherwise
contain too long a string.
This commit extends that fix to include characters entered via \N{...}.
It does this by causing \N handling to be split, so that if the \N
resolves to a single code point, it goes through the normal processing,
so that it no longer bypasses the code that was added in the earlier
commit.
M regcomp.c
M t/re/pat_advanced.t
M t/re/re_tests
commit ebec2cd2b98aa519ebde1cdda31d186906f2b257
Author: Karl Williamson <[email protected]>
Date: Wed Aug 1 14:49:39 2012 -0600
regcomp.c: Revise API for static function
This is to allow future changes. The function now returns success or
failure, and the created regnode (if any) is set via a parameter
pointer.
I removed the 'register' declaration to get this to work, because
such declarations are considered bad form these days, e.g.,
http://stackoverflow.com/questions/314994/whats-a-good-example-of-register-variable-usage-in-c
M embed.fnc
M embed.h
M proto.h
M regcomp.c
commit 6ae2acd94504fcbaf0bce3c8b229731d75f0b7bf
Author: Karl Williamson <[email protected]>
Date: Mon Jun 18 13:09:38 2012 -0600
regcomp.c: Fix multi-char fold bug
Input text to be matched under /i is placed in EXACTFish nodes. The
current limit on such text is 255 bytes per node. Even if we raised
that limit, it will always be finite. If the input text is longer than
this, it is split across 2 or more nodes. A problem occurs when that
split occurs within a potential multi-character fold. For example, if
the final character that fits in a node is 'f', and the next character
is 'i', it should be matchable by LATIN SMALL LIGATURE FI, but because
Perl isn't structured to find multi-char folds that cross node
boundaries, we will miss this it.
The solution presented here isn't optimum. What we do is try to prevent
all EXACTFish nodes from ending in a character that could be at the
beginning or middle of a multi-char fold. That prevents the problem.
But in actuality, the problem only occurs if the input text is actually
a multi-char fold, which happens much less frequently. For example,
we try to not end a full node with an 'f', but the problem doesn't
actually occur unless the adjacent following node begins with an 'i' (or
one of the other characters that 'f' participates in). That is, this
patch splits when it doesn't need to.
At the point of execution for this patch, we only know that the final
character that fits in the node is that 'f'. The next character remains
unparsed, and could be in any number of forms, a literal 'i', or a hex,
octal, or named character constant, or it may need to be decoded (from
'use encoding'). So look-ahead is not really viable.
So finding if a real multi-character fold is involved would have to be
done later in the process, when we have full knowledge of the nodes, at
the places where join_exact() is now called, and would require inserting
a new node(s) in the middle of existing ones.
This solution seems reasonable instead.
It does not yet address named character constants (\N{}) which currently
bypass the code added here.
M embedvar.h
M handy.h
M intrpvar.h
M regcomp.c
M sv.c
M t/re/pat_advanced.t
commit 2bc2d2fe9b12d381cbada518ba3ce28355b22cd0
Author: Karl Williamson <[email protected]>
Date: Mon Jun 18 12:55:42 2012 -0600
mktables: Generate tables for chars that aren't in final fold pos
This starts with the existing table that mktables generates that lists
all the characters in Unicode that occur in multi-character folds, and
aren't in the final positions of any such fold.
It generates data structures with this information to make it quickly
available to code that wants to use it. Future commits will use these
tables.
M charclass_invlists.h
M handy.h
M l1_char_class_tab.h
M regen/mk_PL_charclass.pl
M regen/mk_invlists.pl
commit 8577e0dac3485c96e2919d41fa86e48092316fc2
Author: Karl Williamson <[email protected]>
Date: Mon Jun 18 12:44:55 2012 -0600
regen/mk_invlists: Add mode to generate above-Latin1 only
This change adds the ability to specify that an output inversion list is
to contain only those code points that are above Latin-1. Typically,
the Latin-1 ones will be accessed from some other means.
M regen/mk_invlists.pl
commit 913ff4783f932cf0502bbefcec21ab95cb882aa3
Author: Karl Williamson <[email protected]>
Date: Mon Jun 18 12:38:41 2012 -0600
Unicode::UCD::prop_invlist() Allow to return internal property
This creates an optional undocumented parameter to this function to
allow it to return the inversion list of an internal-only Perl property.
This will be used by other functions in Perl, but should not be
documented, as we don't want to encourage the use of internal-only
properties, which are subject to change or removal without notice.
M lib/Unicode/UCD.pm
commit aac234eaaf5a264d27a4a1811f49a201ea393464
Author: Karl Williamson <[email protected]>
Date: Mon Jun 18 12:37:52 2012 -0600
mktables: Add comment to gen'd data file
M lib/unicore/mktables
commit 7566eca19b016c381b1e162b3f1aef64bb395a20
Author: Karl Williamson <[email protected]>
Date: Mon Jun 18 12:22:41 2012 -0600
mktables: grammar in comments
M lib/unicore/mktables
commit 627e4d0d05b595e152987754698fae41192c4321
Author: Karl Williamson <[email protected]>
Date: Mon Jun 18 12:20:42 2012 -0600
regen/mk_PL_charclass.pl: Remove obsolete code
Octals are no longer checked via this mechanism.
M regen/mk_PL_charclass.pl
commit f6d909b7f5efab877ff6837f9ca0abe5645b8d9b
Author: Karl Williamson <[email protected]>
Date: Mon Jun 18 11:51:43 2012 -0600
regcomp.c: Make invlist_search() usable from re_comp.c
This was a static function which I couldn't get to be callable from the
debugging version of regcomp.c. This makes it public, but known only
in the regcomp.c source file. It changes the name to begin with an
underscore so that if someone cheats by adding preprocessor #defines,
they still have to call it with the name that convention indicates is a
private function.
M embed.fnc
M embed.h
M proto.h
M regcomp.c
commit c032738a2075d08b1a49cc7a1a753013cf2ae4e6
Author: Karl Williamson <[email protected]>
Date: Mon Jun 18 11:41:18 2012 -0600
perlop:clarify wording
M pod/perlop.pod
commit 7a47bd3f8663ea72d56d91e5bf497ec44d0fffb3
Author: Karl Williamson <[email protected]>
Date: Sat Jun 16 20:02:07 2012 -0600
regcomp.c: Rename static fcn to better reflect its purpose
This function handles \N of any ilk, not just named sequences.
M embed.fnc
M embed.h
M proto.h
M regcomp.c
commit e30cd0208e00434d7a2f987e050bc18ee0e31d1a
Author: Karl Williamson <[email protected]>
Date: Sat Jun 16 19:55:15 2012 -0600
regcomp.c: Make comment more accurate
M regcomp.c
commit f2d96eb391b4b2b6de8e8def742b266365c62f96
Author: Karl Williamson <[email protected]>
Date: Sat Jun 16 19:52:12 2012 -0600
regcomp.c: Can now do /u instead of forcing to utf8
Now that there is a /u modifier, a regex doesn't have to be in UTF-8 in
order to force Unicode semantics. Change this relict from the past.
M regcomp.c
commit 7c349068b656634c22086db96c361253046107bb
Author: Karl Williamson <[email protected]>
Date: Wed Jun 6 15:02:43 2012 -0600
regcomp.c: Comments update
This adds some comments and white-space lines, and updates other
comments to account for the fact that trie handling has changed since
they were written.
M regcomp.c
commit c38e58a6a0f111f6935c51fee1bf00063c7558c4
Author: Karl Williamson <[email protected]>
Date: Mon May 28 10:49:37 2012 -0600
regcomp.c: Remove variable whose value needed just once
Previous commits have removed all but one instance of using this
variable, so just use the expression it equates to.
M regcomp.c
commit 25790fa516b96a1d5585d46e8fbd2bb58278968c
Author: Karl Williamson <[email protected]>
Date: Mon May 28 10:42:03 2012 -0600
regcomp.c: White-space only
This indents and outdents to compensate for newly formed and orphan
blocks, respectively; and reflows comments to fit in 80 columns
M regcomp.c
commit fb9c004ac36c5b4d2ef6ecf4b7e0e8eb77599c3d
Author: Karl Williamson <[email protected]>
Date: Sun May 27 01:08:46 2012 -0600
regcomp.c: Trade stack space for time
Pass 1 of regular expression compilation merely calculates the size it
will need. (Note that Yves and I both think this is very suboptimal
behavior.) Nothing is written out during this pass, but sizes are
just incremented. The code in regcomp.c all knows this, and skips
writing things in pass 1. However, when folding, code in other files is
called which doesn't have this size-only mode, and always writes its
results out. Currently, regcomp handles this by passing to that code a
temporary buffer allocated for the purpose. In pass1, the result is
simply ignored; in pass2, the results are copied to the correct final
destination.
We can avoid that copy by making the temporary buffer large enough to
hold the whole node, and in pass1, use it instead of the node. The
non-regcomp code writes to the same relative spot in the buffer that it
will use for the real node. In pass2 the real destination is used, and
the fold gets written directly to the correct spot.
Note that this increases the size pushed onto the stack, but code is
ripped out as well.
However, the main reason I'm doing this is not this speed-up; it is
because it is needed by future commits to fix a bug.
M regcomp.c
commit 43312570a1135c05fdc9153dddd1bd9826d6c242
Author: Karl Williamson <[email protected]>
Date: Sun May 27 01:04:39 2012 -0600
regcomp.c: Use mnemonic not numeric constant
Future commits will add other uses of this number.
M regcomp.c
commit 04b71f26db00eb7309b9efd417c722de571f0e2c
Author: Karl Williamson <[email protected]>
Date: Sat May 26 22:19:22 2012 -0600
regcomp.c: Resolve EBCDIC inconsistency towards simpler
This code has assumed that to_uni_fold() returns its folds in Unicode
(i.e. Latin1) rather than native EBCDIC. Other code in the core
assumes the opposite. One has to change. I'm changing this one, as the
issues should be dealt with at the lowest level possible, which is in
to_uni_fold(). Since we don't currently have an EBCDIC platform to test
on, making sure that it all hangs together will have to be deferred
until such time as we do.
By doing this we make this code simpler and faster. The fold has
already been calculated, we just need to copy it to the final place
(done in pass2).
M regcomp.c
commit 60edcac7b15f283ef98d01f7f5f0ffc2efdbbf7f
Author: Karl Williamson <[email protected]>
Date: Sat May 26 21:39:32 2012 -0600
regcomp.c: Use function instead of repeating its code
A new flag to to_uni_fold() causes it to do the same work that this code
does, so just call it.
M regcomp.c
commit a07ddcd0de968b471beb3bc096b8a7bb79bbfc2c
Author: Karl Williamson <[email protected]>
Date: Sat May 26 14:19:18 2012 -0600
regcomp.c: Remove (almost) duplicate code
A previous commit opened the way to refactor this so that the two
fairly lengthy code blocks that are identical (except for changing the
variable <len>) can have one of them removed.
M regcomp.c
commit 4755d0e71f789f07e22459fc3e014e56c3aff9e2
Author: Karl Williamson <[email protected]>
Date: Thu May 24 22:14:04 2012 -0600
regcomp.c: Refactor so can remove duplicate code
This commit prepares the way for a later commit to remove a chunk of
essentially duplicate code. It does this at the cost of an extra
test of a boolean each time through the loop. But, it saves calculating
the fold unless necessary, a potentially expensive operation. When the
next input is a quantifier that calculated fold is discarded, unused.
This commit avoids doing that calculation when the next input is a
quantifier.
M regcomp.c
commit b7fc285b28ae6eef2a1cb403711f53878e66fccc
Author: Karl Williamson <[email protected]>
Date: Thu May 24 21:39:58 2012 -0600
Revert "regcomp.c: Move duplicated code to inline function"
This reverts commit 1ceb3049131abe6184db5a55104a620ffea6958d.
M regcomp.c
commit 631ee5bced178aaccdace78a8a32113abae49d35
Author: Karl Williamson <[email protected]>
Date: Sun May 6 08:10:33 2012 -0600
regcomp.c: Move duplicated code to inline function
This simply extracts the code to one function with only required
ancillary changes. Later commits will clean things up
M regcomp.c
-----------------------------------------------------------------------
--
Perl5 Master Repository