In perl.git, the branch smoke-me/khw-clump has been created
<http://perl5.git.perl.org/perl.git/commitdiff/a95df5d20ee28f5958282291355902c727b08064?hp=0000000000000000000000000000000000000000>
at a95df5d20ee28f5958282291355902c727b08064 (commit)
- Log -----------------------------------------------------------------
commit a95df5d20ee28f5958282291355902c727b08064
Author: Karl Williamson <[email protected]>
Date: Wed Sep 5 20:56:09 2012 -0600
utf8.h: Use machine generated IS_UTF8_CHAR()
This takes the output of regen/regcharclass.pl for all the 1-4 byte
UTF8-representations of Unicode code points, and replaces the current
hand-rolled definition there. It does this only for ASCII platforms,
leaving EBCDIC to be machine generated when run on such a platform.
I would rather have both versions to be regenerated each time it is
needed to save an EBCDIC dependency, but it takes more than 10 minutes
on my computer to process the 2 billion code points that have to be
checked for on ASCII platforms, and currently t/porting/regen.t runs
this program every times; and that slow down would be unacceptable. If
this is ever run under EBCDIC, the macro should be machine computed
(very slowly). So, even though there is an EBCDIC dependency, it has
essentially been solved.
M regen/regcharclass.pl
M utf8.h
commit 0fe2099a56866e2d01f852e23c94b5bebe22b536
Author: Karl Williamson <[email protected]>
Date: Wed Sep 5 20:48:15 2012 -0600
regen/regcharclass.pl: Add ability to restrict platforms
This adds the capability to skip definitions if they are for other than
a desired platform.
M regen/regcharclass.pl
commit fd07688fb0f9a5568799fb908f6578264c1c9118
Author: Karl Williamson <[email protected]>
Date: Wed Sep 5 20:32:29 2012 -0600
utf8.h: Remove some EBCDIC dependencies
regen/regcharclass.pl has been enhanced in previous commits so that it
generates as good code as these hand-defined macro definitions for
various UTF-8 constructs. And, it should be able to generate EBCDIC
ones as well. By using its definitions, we can remove the EBCDIC
dependencies for them. It is quite possible that the EBCDIC versions
were wrong, since they have never been tested. Even if
regcharclass.pl has bugs under EBCDIC, it is easier to find and fix
those in one place, than all the sundry definitions.
M regcharclass.h
M regen/regcharclass.pl
M utf8.h
commit d5d6537b407a7bfd9e47203410bd8cf5a6e5c71c
Author: Karl Williamson <[email protected]>
Date: Wed Sep 5 15:18:09 2012 -0600
regen/regcharclass.pl: Add optimization
On UTF-8 input known to be valid, continuation bytes must be in the
range 0x80 .. 0x9F. Therefore, any tests for being within those bounds
will always be true, and may be omitted.
M regcharclass.h
M regen/regcharclass.pl
commit 277bb7bc224e2745df60a1c993938f181e16af1c
Author: Karl Williamson <[email protected]>
Date: Wed Sep 5 15:14:59 2012 -0600
regen/regcharclass.pl: White-space only
Indent a newly-formed block
M regen/regcharclass.pl
commit 8567691dd1c94c4acb7e0d6f426d0d0c6ab8ad3f
Author: Karl Williamson <[email protected]>
Date: Wed Sep 5 15:00:52 2012 -0600
regen/regcharclass.pl: Extend previously added optimization
A previous commit added an optimization to save a branch in the
generated code at the expense of an extra mask when the input class has
certain characteristics. This extends that to the case where
sub-portions of the class have similar characteristics. The first
optimization for the entire class is moved to right before the new loop
that checks each range in it.
M regcharclass.h
M regen/regcharclass.pl
commit 242b6b634241288353c94c868e2a1f813f55684c
Author: Karl Williamson <[email protected]>
Date: Wed Sep 5 09:30:34 2012 -0600
regen/regcharclass.pl: Rmv always true components from gen'd macro
This adds a test and returns 1 from a subroutine if the condition will
always match; and in the caller it adds a check for that, and omits the
condition from the generated macro.
M regen/regcharclass.pl
commit 38877c8ac730d93fb5554da543c322d28d6822ae
Author: Karl Williamson <[email protected]>
Date: Tue Sep 4 14:54:26 2012 -0600
regen/regcharclass.pl: Add an optimization
Branches can be eliminated from the macros that are generated here
by using a mask in cases where applicable. This adds checking to see if
this optimization is possible, and applies it if so.
M regcharclass.h
M regen/regcharclass.pl
commit cda7fef31fe67f3a27e8be9e3b4aef548ba40c6e
Author: Karl Williamson <[email protected]>
Date: Wed Sep 5 10:26:22 2012 -0600
regen/regcharclass.pl: Rename a variable
I find it confusing that the array element name is the same as the full
array
M regen/regcharclass.pl
commit e87b0ad4364f86bf05ae633406a640d50f03e448
Author: Karl Williamson <[email protected]>
Date: Tue Sep 4 14:12:13 2012 -0600
regen/regcharclass.pl: Pass options deeper into call stack
This is to prepare for future commits which will act differently at the
deep level depending on some of the options.
M regen/regcharclass.pl
commit 4ce32977781f8bff1db2b7d0b2dae7f77b781a36
Author: Karl Williamson <[email protected]>
Date: Mon Sep 3 16:59:09 2012 -0600
XXX Benchmark: pp.c: Use macro not swash for utf8 quotemeta
The rules for matching whether an above-Latin1 code point are now saved
in a macro generated from a trie by regen/regcharclass.pl, and these are
now used by pp.c to test these cases. This allows removal of a wrapper
subroutine, and also there is no need for dynamic loading at run-time
into a swash.
This macro is about as big as I'm comfortable compiling in, but the
savings of a hash and the removed subroutine and interpreter variables
make it a wash I suspect, without checking.
M embed.fnc
M embed.h
M embedvar.h
M intrpvar.h
M pp.c
M proto.h
M regcharclass.h
M regen/regcharclass.pl
M sv.c
M utf8.c
commit 589296ec5a664e51b5d1fd0e2dbc039bb5b300fe
Author: Karl Williamson <[email protected]>
Date: Mon Sep 3 16:54:56 2012 -0600
regen/regcharclass.pl: Add new output macro type
The new type 'high' is used on only above-Latin1 code points. It is
designed for code that already knows the tested code point is not
Latin1, and avoids unnecessary tests.
M regen/regcharclass.pl
commit 888fa0d7c5a147d0dbd1e47bacba789b808bb298
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 18:29:42 2012 -0600
regen/regcharclass.pl: Add documentation
M regen/regcharclass.pl
commit 3ddd3ed4b7b836ca88beec802575874c862fe355
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 18:28:19 2012 -0600
regen/regcharclass.pl: Error check input better
This makes sure that the modifiers specified in the input are known to
the program.
M regen/regcharclass.pl
commit ae0eedeb35ba617b8e99646315aa7fef378df479
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 16:48:14 2012 -0600
regen/regcharclass.pl: Allow comments in input
Lines whose first non-blank character is a '#' are now considered to be
comments, and ignored. This allows the moving of some lines that have
been commented out back to after the __DATA__ where they really belong.
M regen/regcharclass.pl
commit 36c10281e3a3c33dc7ac13129938b1298ba468c3
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 15:58:41 2012 -0600
regen/unicode_constants.pl: Add name parameter
A future commit will want to use the first surrogate code point's UTF-8
value. Add this to the generated macros, and give it a name, since
there is no official one. The program has to be modified to cope with
this.
M regen/unicode_constants.pl
M unicode_constants.h
commit 60644b22a741cb42269b33a068369401b5bef1b2
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 15:29:32 2012 -0600
Move 2 functions from utf8.c to regexec.c
One of these functions is currently commented out. The other is called
only in regexec.c in one place, and was recently revised to no longer
require the static function in utf8.c that it formerly called. They can
be made static inline.
M embed.fnc
M embed.h
M proto.h
M regexec.c
M utf8.c
commit e967b2c98113fb54d4529a6dc355026c0fe5fcc1
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 14:46:38 2012 -0600
XXX benchmarks regexec.c: Use new macros instead of swashes
A previous commit has caused macros to be generated that will match
Unicode code points of interest to the \X algorithm. XXX
M embed.fnc
M embed.h
M embedvar.h
M intrpvar.h
M proto.h
M regen/unicode_constants.pl
M regexec.c
M sv.c
M unicode_constants.h
M utf8.c
commit c84b2290e63ac2cfaac978e1a18e9a3faf0b8c08
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 14:31:59 2012 -0600
regen/regcharclass.pl: Generate macros for \X processing
\X is implemented in regexec.c as a complicated series of property
look-ups. It turns out that many of those are for just a few code
points, and so can be more efficiently implemented with a macro than a
swash. This generates those.
M regcharclass.h
M regen/regcharclass.pl
commit b503148c4fb55216e628afc42b14deef05b721ab
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 14:26:20 2012 -0600
regen/regcharclass.pl: Change to work on an empty class
Future commits will add Unicode properties for this to generate macros,
and some of them may be empty in some Unicode releases. This just
causes such a generated macro to evaluate to 0.
M regen/regcharclass.pl
commit 25377a052ced77332eaa6ffd5e8c9f910d6628c0
Author: Karl Williamson <[email protected]>
Date: Fri Aug 31 17:04:30 2012 -0600
regen/regcharclass.pl: Fix bug for character '0'
The character '0' could be omitted from some generated macros due to
it's testing the value of a hash entry (getting 0 or false) instead
of if it exists or not.
M regen/regcharclass.pl
commit 878d1ee42f154b318b11d689991571f3266191ce
Author: Karl Williamson <[email protected]>
Date: Fri Aug 31 17:00:27 2012 -0600
regen/regcharclass.pl: Work on EBCDIC platforms
This will now automatically generate macros for non-ASCII platforms,
by mapping the Unicode input to native output.
Doing this will allow several cases of EBCDIC dependencies in other code
to be removed, and fixes the bug that this previously had with non-ASCII
platforms.
M regen/regcharclass.pl
commit 8d930aead7dfa95ab62704011bf704bfae3095b2
Author: Karl Williamson <[email protected]>
Date: Mon Sep 3 16:22:32 2012 -0600
regen/regcharclass.pl: Remove Encode:: dependency
Newer options to unpack alleviate the need for Encode, and run faster.
M regen/regcharclass.pl
commit 8a6d228b0b8fca1947c70c60691f5f95ae89dd4a
Author: Karl Williamson <[email protected]>
Date: Fri Aug 31 16:39:31 2012 -0600
regen/regcharclass.pl: Handle ranges, \p{}
Instead of having to list all code points in a class, you can now use
\p{} or a range.
This changes some classes to use the \p{}, so that any changes Unicode
makes to the definitions don't have to manually be done here as well.
M regcharclass.h
M regen/regcharclass.pl
commit ce47d3dd885f6ed91eca62b8d62cbbfff95268c6
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 13:09:48 2012 -0600
utf8.h: Save a branch in a macro
By adding a mask, we can save a branch. The two expressions match the
exact same code points.
M utf8.h
commit 017bf4ad71fcdd2b3cb06f9010bba23e83e07ddc
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 13:08:21 2012 -0600
utf8.h: White-space only
This reflows some lines to fit into 80 columns
M utf8.h
commit 58f8427dc8cf4200faf8c1d3dfa99fd6b950f52d
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 13:01:50 2012 -0600
utf8.h: Correct improper EBCDIC conversion
These macros were incorrect for EBCDIC. The relationships are based on
I8, the intermediate-utf8 defined for UTF-EBCDIC, not the final encoding.
I was the culprit who did this orginally; I was confused by the names of
the conversion macros. I'm adding names that are clearer to me; which
have already been defined in utfebcdic.h, but weren't defined for
non-EBCDIC platforms.
M utf8.h
commit 0c56998dfe202cd3ab120cab7174cd7a07669511
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 10:37:26 2012 -0600
ext/B/B.xs: Remove EBCDIC dependency
These are unnecessary EBCDIC dependencies: It uses isPRINT() on EBCDIC,
and an expression on ASCII, but isPRINT() is defined to be precisely
that expression on ASCII platforms.
M ext/B/B.xs
commit a6d19a8d4ca6e4ef9e7d28564d6f44d4d50160a7
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 10:30:32 2012 -0600
Remove some EBCDIC dependencies
A new regen'd header file has been created that contains the native
values for certain characters. By using those macros, we can eliminate
EBCDIC dependencies.
M perl.h
M utf8.h
M utfebcdic.h
M x2p/a2py.c
commit 4c4fc299e55b9bbf9a4bba6f4b7ed40926a04175
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 09:58:43 2012 -0600
Rename regen'd hdr to reflect expanded capabilities
The recently added utf8_strings.h has been expanded to include more than
just strings. I'm renaming it to avoid confusion.
M MANIFEST
M regcomp.c
A regen/unicode_constants.pl
D regen/utf8_strings.pl
M regexec.c
A unicode_constants.h
D utf8_strings.h
commit 6f7dac32304a0dbb7a416c13b03638941acb9625
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 09:44:22 2012 -0600
regen/utf8_strings.pl: Add ability to get native charset
This adds a new capability to this program: to input a Unicode code point
and
create a macro that expands to the platform's native value for it.
This will allow removal of a bunch of EBCDIC dependencies in the core.
M regen/utf8_strings.pl
M utf8_strings.h
commit 3efcf089ad9b591b6e4e48609dad192004177a38
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 09:28:55 2012 -0600
regen/utf8_strings.pl: Allow explicit default on input
An input line without a command is considered to be a request for the
UTF-8 encoded string of the code point. This allows an explicit
'string' to be used.
M regen/utf8_strings.pl
commit e04f6576857c1bf441098c45e476adddf52dc866
Author: Karl Williamson <[email protected]>
Date: Sun Sep 2 09:22:16 2012 -0600
regen/utf8_strings.pl: Copy empty input lines to output
This allows the generated .h to look better.
M regen/utf8_strings.pl
M utf8_strings.h
commit c70734392ab78654ad68fd357069ee7ef5fd4dff
Author: Karl Williamson <[email protected]>
Date: Fri Aug 31 17:41:14 2012 -0600
/regcharclass.pl, utf8_strings.pl: Add guard to .h
Future commits will have other headers #include the headers generated by
these programs. It is best to guard against the preprocessor from
trying to process these twice
M regcharclass.h
M regen/regcharclass.pl
M regen/utf8_strings.pl
M utf8_strings.h
commit 1a4dac99bc47c484cf139b1b77e1ab3904a81c4f
Author: Karl Williamson <[email protected]>
Date: Fri Aug 31 17:39:04 2012 -0600
Unicode/UCD.pm: Clarify pod
M lib/Unicode/UCD.pm
commit 281acc57cc9deed1351e4044c9aeffa680abf79a
Author: Karl Williamson <[email protected]>
Date: Tue Aug 28 17:41:41 2012 -0600
Fix \X handling for Unicode 5.1 - 6.0
Commit 27d4fc33343f0dd4287f0e7b9e6b4ff67c5d8399 neglected to include a
change required for a few Unicode releases where the \X prepend property
is not empty. This does that, and suppresses a mktables warning for
Unicode releases prior to 6.2
M lib/unicore/mktables
M regexec.c
-----------------------------------------------------------------------
--
Perl5 Master Repository