In perl.git, the branch smoke-me/khw-invlist has been created
<http://perl5.git.perl.org/perl.git/commitdiff/872652fefc7911c8a78fbb00327745dc1ba12d3e?hp=0000000000000000000000000000000000000000>
at 872652fefc7911c8a78fbb00327745dc1ba12d3e (commit)
- Log -----------------------------------------------------------------
commit 872652fefc7911c8a78fbb00327745dc1ba12d3e
Author: Karl Williamson <[email protected]>
Date: Tue Dec 6 11:46:02 2011 -0700
more squishing
M regcomp.c
commit a5dbd8cfd4d7f9d8c5c9a63e728f3271cdf01088
Author: Karl Williamson <[email protected]>
Date: Tue Dec 6 10:53:16 2011 -0700
more temporary commit
M regcomp.c
commit e0d26c18c4bc45a440f67b8d13fbffe6dbccc9e8
Author: Karl Williamson <[email protected]>
Date: Tue Dec 6 10:44:27 2011 -0700
different squish white space
M regcomp.c
commit 2b722e086ed5e76eb99b83a8bd4723646b067d47
Author: Karl Williamson <[email protected]>
Date: Tue Dec 6 10:42:27 2011 -0700
squisch white space
M regcomp.c
commit 81026b2b306f2e57222eb7b44e99f4be21461e87
Author: Karl Williamson <[email protected]>
Date: Tue Dec 6 10:27:53 2011 -0700
squish safefree
M regcomp.c
commit cbbfd4e27d722552ee9c8e4f27c13ad492791105
Author: Karl Williamson <[email protected]>
Date: Tue Dec 6 09:37:02 2011 -0700
regcomp.c: see if avoids embed_thr bug
M regcomp.c
commit eb6fcaf5a6b64d7752d38e24c452bbee884cc414
Author: Karl Williamson <[email protected]>
Date: Tue Dec 6 09:21:21 2011 -0700
squish
M regcomp.c
commit 77465be60452670fa2e9bfc0007a2433b4f78f2f
Author: Karl Williamson <[email protected]>
Date: Tue Nov 29 17:54:11 2011 -0700
utf8.c: Silence compiler warnings
These variables aren't used uninitialized in these functions, but the
logic might change in the future without considering all the
consequences, so silence the warnings.
M utf8.c
commit 02031aaf10412a36c281a005a45d6fbecf82aa53
Author: Karl Williamson <[email protected]>
Date: Tue Nov 29 14:57:02 2011 -0700
regcomp.c: Don't read beyond input
This code was assuming that there were several more bytes in the input
stream, when there may not be. This was discovered by valgrind.
M regcomp.c
commit 8791b06c6734cdce650298c7561d5438b533e6b1
Author: Karl Williamson <[email protected]>
Date: Mon Nov 28 12:32:02 2011 -0700
regcomp.c: Optimize a single Unicode property in a [character class]
All Unicode properties actually turn into bracketed character classes,
whether explicitly done or not. A swash is generated for each property
in the class. If that is the only thing not in the class's bitmap, it
specifies completely the non-bitmap behavior of the class, and can be
passed explicitly to regexec.c. This avoids having to regenerate the
swash. It also means that the same swash is used for multiple instances
of a property. And that means the number of duplicated data structures
is greatly reduced. This currently doesn't extend to cases where
multiple Unicode properties are used in the same class
[\p{greek}\p{latin}] will not share the same swash as another character
class with the same components. This is because I don't know of a
an efficient method to determine if a new class being parsed has the
same components as one already generated. I suppose some sort of
checksum could be generated, but that is for future consideration.
M regcomp.c
M t/lib/warnings/utf8
M t/uni/cache.t
commit f83a1595024426a9bf4fbb7e62afeb7566bade86
Author: Karl Williamson <[email protected]>
Date: Mon Nov 28 10:26:28 2011 -0700
Move Unicode property defn processing to compile time
This patch moves the processing of most Unicode property definitions
from execution (regexec.c) to compilation (regcomp.c). There is a cost
to do this. By deferring it to execution, it may be that the affected
patch will never be taken, and hence the work won't have to be done;
whereas, it's always done if it gets done at compilation.
However, doing it at compilation, has many advantages. We can't
optimize what we don't know about, so this allows for better
optimization, as well as feature enhancements, such as set
manipulations, restricting matches to certain scripts, etc. A big one,
about to be committed allows for significantly reducing the number of
copies of the data structure used for each property. (Currently, every
mention in every regular expression of a given property will generate a
new instance of its hash, and so results of look-ups of code points in
one instance aren't automatically known to other instances, so the code
point has to be looked-up again.)
This commit leaves the processing to execution time when the class is to
be inverted. This was done purely to make the commit smaller, and will
be removed in a future commit; hence the redundant test here will be
removed shortly.
It also has to leave to execution time processing of properties whose
definition is not known yet. That can happen when the property is
user-defined. We call _core_swash_init(), and if it fails, we assume
that it's because it's such a property, and if it turns out that it was
an unknown property, we leave to execution time the raising of a warning
for it, just as before.
M regcomp.c
commit 447dae57c9584dd0ce1e1b3f92324652b27c2231
Author: Karl Williamson <[email protected]>
Date: Mon Nov 28 09:43:54 2011 -0700
regcomp.c: Pass inversion list directly to regexec.c
Currently, any generated inversion list is stringified and passed in the
data structure to regexec.c as such. regexec.c then calls
_core_swash_init() to convert it into a swash and back into an inversion
list. This intermediate step is wasteful, and this commit dispenses
with it, based on preparatory commits in regexec.c and utf8.c
M regcomp.c
commit 3c9dfa4f0e6faea4d2830a338f18f5ff9bfb7e8b
Author: Karl Williamson <[email protected]>
Date: Mon Nov 28 09:25:45 2011 -0700
regexec.c: Prepare for inversion lists in ANYOF nodes
Future commits will start passing inversion lists to regexec.c from the
compilation phase. This commit causes regexec.c to accept them, trace
them for debug output, and pass them along to utf8.c
M regexec.c
commit 47cb59a455b68e0cdb741b5e5a37b8adefb9ef4c
Author: Karl Williamson <[email protected]>
Date: Mon Nov 28 09:20:12 2011 -0700
regcomp.c: Add _invlist_contents() to compactly dump inversion list
This will be used in future commits for debug traces
M embed.fnc
M embed.h
M proto.h
M regcomp.c
commit 039fbcad597e91322f929bc6cdaa32358334a045
Author: Karl Williamson <[email protected]>
Date: Mon Nov 28 09:00:52 2011 -0700
utf8.c: White-space only
As a result of previous commits adding and removing if() {} blocks,
indent and outdent and reflow comments and statements to not exceed 80
columns.
M utf8.c
commit 7780f365344a744aaca7f78ceb199c7a193c3a7f
Author: Karl Williamson <[email protected]>
Date: Mon Nov 28 08:36:54 2011 -0700
utf8.c: Add ability to pass inversion list to _core_swash_init()
Add a new parameter to _core_swash_init() that is an inversion list to
add to the swash, along with a boolean to indicate if this inversion
list is derived from a user-defined property. This capability will prove
useful in future commits
M embed.fnc
M embed.h
M proto.h
M utf8.c
commit cf9fbaa62d8ebb7bc689b9a61bd7e4bdedb82756
Author: Karl Williamson <[email protected]>
Date: Mon Nov 28 08:24:07 2011 -0700
utf8.c: Add flag to swash_init() to not croak on error
This adds the capability, to be used in future commits, for swash_ini()
to return NULL instead of croaking if it can't find a property, so that
the caller can choose how to handle the situation.
M embed.fnc
M embed.h
M proto.h
M utf8.c
commit f649cf48e78f8391f580f9664f24373cd50b1d75
Author: Karl Williamson <[email protected]>
Date: Sun Nov 27 20:55:33 2011 -0700
regcomp.c: Use '*a == b', not 'a == &b'
The latter doesn't always work. The consequences of this failure were
memory leaks
M regcomp.c
commit 7d99acec7d08718ae4c2d29c5cf5a1c2c1377568
Author: Karl Williamson <[email protected]>
Date: Sun Nov 27 20:53:56 2011 -0700
regcomp.c: decrement ptr ref cnt before invalidating ptr
Otherwise there coul be memory leaks
M regcomp.c
commit bc6fb35b4eabf48168509b437b018a24d9cbbb3a
Author: Karl Williamson <[email protected]>
Date: Sun Nov 27 20:49:59 2011 -0700
regcomp.c: Add some assertions
Subsequent code assumes that these are true
M regcomp.c
commit d733ca4e637b862a05ee0436cf7dc966a904b99e
Author: Karl Williamson <[email protected]>
Date: Sun Nov 27 15:45:22 2011 -0700
regcomp.c: Don't overallocate space for cloned SV
The length passed to _new_invlist() is in elements and not bytes, so
this was overallocating space because the number of bytes is multiplied
by a platform-dependent value.
M regcomp.c
commit 88832752f4ce21b7b886742e27184ab0061b1e92
Author: Karl Williamson <[email protected]>
Date: Sun Nov 27 15:41:26 2011 -0700
regcomp.c: Make sure invlist_clone length set correctly
The cloned inversion list was getting initialized with sufficient space,
but because a Copy was used, it did not know how much of that space is
occupied. There are no tests, because this was found through valgrind,
and it otherwise depends on whatever was in the uninitialized data at
the time
M regcomp.c
commit 0e5fef5eeb2995ecda11ae35e17bad554acc9026
Author: Karl Williamson <[email protected]>
Date: Sun Nov 27 15:34:52 2011 -0700
utf8.c: Prevent reading before buffer start
Make sure there is something before the character being read before
reading it.
M utf8.c
commit 230528e3e39f08f8b8ee80cc286988cc65d70ac4
Author: Karl Williamson <[email protected]>
Date: Sat Nov 26 18:24:46 2011 -0700
Utf8.c: Generate and use inversion lists for binary swashes
Prior to this patch, every time a code point was matched against a swash,
and the result was not previously known, a linear search through the
swash was performed. This patch changes that to generate an inversion
list whenever a swash for a binary property is created. A binary search
is then performed for missing values.
This change does not have much effect on the speed of Perl's regression
test suite, but the speed-up in worst-case scenarios is huge. The
program at the end of this commit is crafted to avoid the caching that
hides much of the current inefficiencies. At character classes of 100
isolated code points, the new method is about an order of magnitude
faster; two orders of magnitude at 1000 code points. The program at the
end of this commit message took 97s to execute on my box using blead,
and 1.5 seconds using this new scheme. I was surprised to see that even
with classes containing fewer than 10 code points, the binary search
trumped, by a little, the linear search
Even after this patch, under the current scheme, one can easily run out
of memory due to the permanent storing of results of swash lookups in
hashes. The new search mechanism might be fast enough to enable the
elimination of that memory usage. Instead, a simple cache in each
inversion list that stored its previous result could be created, and
that checked to see if it's still valid before starting the search,
under the assumption, which the current scheme also makes, that probes
will tend to be clustered together, as nearby code points are often in
the same script.
===============================================
# This program creates longer and longer character class lists while
# testing code points matches against them. By adding or subtracting
# 65 from the previous member, caching of results is eliminated (as of
# this writing), so this essentially tests for how long it takes to
# search through swashes to see if a code point matches or not.
use Benchmark ':hireswallclock';
my $string = "";
my $class_cp = 2**30; # Divide the code space in half, approx.
my $string_cp = $class_cp;
my $iterations = 10000;
for my $j (1..2048) {
# Append the next character to the [class]
my $hex_class_cp = sprintf("%X", $class_cp);
$string .= "\\x{$hex_class_cp}";
$class_cp -= 65;
next if $j % 100 != 0; # Only test certain ones
print "$j: lowest is [$hex_class_cp]: ";
timethis(1, "no warnings qw(portable non_unicode);my \$i = $string_cp;
for (0 .. $iterations) { chr(\$i) =~ /[$string]/; \$i+= 65 }");
$string_cp += ($iterations + 1) * 65;
}
M utf8.c
commit 3102c724f371ed03e1bedaa0a157decaedc09006
Author: Karl Williamson <[email protected]>
Date: Fri Nov 25 12:59:51 2011 -0700
regcomp.c: Add _invlist_populate_swatch()
This function will be used in future commits
M embed.fnc
M embed.h
M proto.h
M regcomp.c
commit 7f381b5d2844559a8786bd05e3da23bf76d715ed
Author: Karl Williamson <[email protected]>
Date: Fri Nov 25 12:49:02 2011 -0700
regcomp.c: Add invlist_search()
This function does a binary search on an inversion list. It will be
used in future commits
M embed.fnc
M embed.h
M proto.h
M regcomp.c
commit 2ba7a6f05e6153ec831787bcf843cee3eafe3d70
Author: Karl Williamson <[email protected]>
Date: Wed Nov 23 20:11:26 2011 -0700
utf8.c: Refactor code slightly in prep
Future commits will split up the necessary initialization into two
components. This patch prepares for that without adding anything new.
M utf8.c
commit 30f3f991d7c77604e7b588969917be469894e513
Author: Karl Williamson <[email protected]>
Date: Tue Nov 22 13:37:04 2011 -0700
regcomp.c: Change internal #define name
I have struggled to come up with a good name for this concept; and like
the new one better than the old
M regcomp.c
commit bac03619a9241602e705ed528d2c460968d4ed22
Author: Karl Williamson <[email protected]>
Date: Tue Nov 22 12:06:41 2011 -0700
utf8.c: New function to retrieve non-copy of swash
Currently, swash_init returns a copy of the swash it finds. The core
portions of the swash are read-only, and the non-read-only portions are
derived from them. When the value for a code point is looked up, the
results for it and adjacent code points are stored in a new element,
so that the lookup never has to be performed again. But since a copy is
returned, those results are stored only in the copy, and any other uses
of the same logical stash don't have access to them, so the lookups have
to be performed for each logical use.
Here's an example. If you have 2 occurrences of /\p{Upper}/ in your
program, there are 2 different swashes created, both initialized
identically. As you start matching against code points, say "A" =~
/\p{Upper}/, the swashes diverge, as the results for each match are
saved in the one applicable to that match. If you match "A" in each
swash, it has to be looked up in each swash, and an (identical) element
will be saved for it in each swash. This is wasteful of both time and
memory.
This patch renames the function and returns the original and not a copy,
thus eliminating the overhead for stashes accessed through the new
interface. The old function name is serviced by a new function which
merely wraps the new name result with a copy, thus preserving the
interface for existing calls.
Thus, in the example above, there is only one swash, and matching "A"
against it results in only one new element, and so the second use will
find that, and not have to go out looking again. In a program with lots
of regular expressions, the savings in time and memory can be quite
large.
The new name is restricted to use only in regcomp.c and utf8.c (unless
XS code cheats the preprocessor), where we will code so as to not
destroy the original's data. Otherwise, a change to that would change
the definition of a Unicode property everywhere in the program.
Note that there are no current callers of the new interface; these will
be added in future commits.
M embed.fnc
M embed.h
M proto.h
M utf8.c
commit 890da47a017e6dc25c4ee91123ff3e14003e1f0a
Author: Karl Williamson <[email protected]>
Date: Tue Nov 22 09:18:28 2011 -0700
utf8_heavy.pl: Add inversion status to cache key
Contrary to what the debug statement said, what is being returned is a
swash, and that swash is different from one that comes from the same
file but differs in inversion, and so changing the INVERT_IT element
messes things up for any existing one. Heretofore it hasn't mattered
because the swash returned is always a copy, and so it actually hasn't
created any problems. But future commits will stop the copying, so this
would create problems then.
The file will now have to be re-'do'ne to get an inverted list from it.
M lib/utf8_heavy.pl
commit f741aeebf37688d1ca136b3c0fa4c4079cb53f2d
Author: Karl Williamson <[email protected]>
Date: Sun Nov 20 19:10:19 2011 -0700
uni/cache.t: Fix to handle regex compile time Uni props
Future commits are planned to move the resolution of Unicode properties
from regex execution time to compile time. By moving the code into a
BEGIN block, this .t can now handle both types. Before this patch, it
wouldn't show any activity at all if things are done at compile time.
M t/uni/cache.t
commit a5c941a76d99bf2e5d6732290c7a5901b2447d96
Author: Karl Williamson <[email protected]>
Date: Sun Nov 20 09:03:26 2011 -0700
embed.fnc: swash_init() return value should not be ignored
Otherwise can have memory leaks
M embed.fnc
M proto.h
commit 7f2f6f677be0834c266c5e955881903cc0d5323e
Author: Karl Williamson <[email protected]>
Date: Sat Nov 19 16:50:33 2011 -0700
utf8.c: Change name of static function
This function has always confused me, as it doesn't return a swash, but
a swatch.
M embed.fnc
M embed.h
M lib/utf8_heavy.pl
M proto.h
M utf8.c
commit a8560a7ff321b695d72b1a905a6bd39a5b45d6f5
Author: Karl Williamson <[email protected]>
Date: Sat Nov 19 14:49:20 2011 -0700
utf8_heavy: Allow to be called from regcomp.c
Future commits will cause regcomp.c to try to compile user-defined
properties. The caller stack is different for this, and there may be a
package name as well that differs from the existing scheme. This commit
allows for this.
M lib/utf8_heavy.pl
commit 5f451ceeb13dbd70ec6e689e72f340849e33ba22
Author: Karl Williamson <[email protected]>
Date: Sat Nov 19 14:47:33 2011 -0700
utf8_heavy: Add DEBUG statement
This helps keep track of the recursion used.
M lib/utf8_heavy.pl
commit f2a3b6a8ae820ed7ab2591d8dfb15ea1f0bbfc83
Author: Karl Williamson <[email protected]>
Date: Sat Nov 19 14:41:26 2011 -0700
utf8.c: Move test out of loops
We set the upper limit of the loops before entering them to the min of
the two possible limits, thus avoiding a test each time through
M utf8.c
commit 2d8d82cb26024b2f5fe0426c6805ac4f178a325f
Author: Karl Williamson <[email protected]>
Date: Sat Nov 19 14:37:35 2011 -0700
mktables: Add a little stress to the tests
This simply reverses the sort order so that the generated tests
use the highest ranges instead of the lowest, making it less likely that
tests will pass by chance; and also increasing performance issues in
finding matches.
M lib/unicore/mktables
commit 10af155cdec6bf0115d5c625e843d630351cd3eb
Author: Karl Williamson <[email protected]>
Date: Sat Nov 19 14:22:00 2011 -0700
utf8_heavy: Skip unnecessary operations
The mktables generated tables are well-formed, already sorted, and with
no extra items such as "+utf8::foo". Thus we don't have to do these
operations on them, but they are required on user-defined properties,
and should $list be passed in as a parameter.
This patch moves the code that does this to just the user-defined area
M lib/utf8_heavy.pl
commit 18df1d63ae1ea9ac4abf3ac099d8986334d370be
Author: Karl Williamson <[email protected]>
Date: Sat Nov 19 11:17:17 2011 -0700
uni/class.t: Add test
This new test makes sure that a regular expression that forward
references a user-defined property works.
M t/uni/class.t
commit 2caff13f4dd75c306d855c6d118db07a22e3e94b
Author: Karl Williamson <[email protected]>
Date: Fri Nov 18 10:22:56 2011 -0700
utf8_heavy: remove unused variable
M lib/utf8_heavy.pl
commit ecbd26fb74467f76bcff4c91ed9176bd913ef970
Author: Karl Williamson <[email protected]>
Date: Fri Nov 18 08:36:43 2011 -0700
Comment additions, typos, white-space.
M lib/unicore/mktables
M lib/utf8_heavy.pl
M regcomp.c
M regcomp.h
M utf8.c
commit a0c0a17827a358813012807f994844f593691971
Author: Karl Williamson <[email protected]>
Date: Wed Nov 16 20:03:35 2011 -0700
regexec.c: Add some comments to regclass_swash()
M regexec.c
commit 353fd81465e44bfd8164130e439684d68657f7d4
Author: Karl Williamson <[email protected]>
Date: Wed Nov 16 09:45:07 2011 -0700
regexec.c: Remove unnecessary intermediate values
M regexec.c
-----------------------------------------------------------------------
--
Perl5 Master Repository