In perl.git, the branch smoke-me/khw-anyofr has been created
<https://perl5.git.perl.org/perl.git/commitdiff/3f671efa445b7c17c8b545f96d2ad6e011eac273?hp=0000000000000000000000000000000000000000>
at 3f671efa445b7c17c8b545f96d2ad6e011eac273 (commit)
- Log -----------------------------------------------------------------
commit 3f671efa445b7c17c8b545f96d2ad6e011eac273
Author: Karl Williamson <[email protected]>
Date: Thu Sep 19 16:22:19 2019 -0600
Add ANYOFRb regnode
This is like the ANYOFR regnode added in the previous commit, but all
code points in the range it matches are known to have the same first
UTF-8 start byte. That means it can't match UTF-8 invariant characters,
like ASCII, because the "start" byte is different on each one, so it
could only match a range of 1, and the compiler wouldn't generate this
node for that; instead using an EXACT.
Pattern matching can rule out most code points by looking at the first
character of their UTF-8 representation, before having to convert from
UTF-8.
On ASCII this rules out all but 64 2-byte UTF-8 characters from this
simple comparison. 3-byte it's up to 4096, and 4-byte, 2**18, so the
test is less effective for higher code points.
I believe that most UTF-8 patterns that otherwise would compile to
ANYOFR will instead compile to this, as I can't envision real life
applications wanting to match large single ranges. Even the 2048
surrogates all have the same first byte.
commit d23c7575b36dd35bec17947b835aeb878dd8e36b
Author: Karl Williamson <[email protected]>
Date: Thu Sep 19 16:05:06 2019 -0600
Add ANYOFR regnode
This matches a single range of code points. It is both faster and
smaller than other ANYOF-type nodes, requiring, after set-up, a single
subtraction and conditional branch.
The vast majority of Unicode properties match a single range, though
most of these are not likely to be used in real world applications. But
things like [ij] are a single range, and those are quite commonly
encountered. This matches them more efficiently than a bitmap would,
and doesn't require the space for one either.
The flags field is used to store the minimum matchable start byte for
UTF-8 strings, and is ignored for non-UTF-8 targets. This, like ANYOFH
nodes which have the same mechanism, allows for quick weeding out of
many possible matches without having to convert the UTF-8 to its
corresponding code point.
This regnode packs the 32 bit argument with 20 bits for the minimum code
point the node matches, and 12 bits for the maximum range. Values
outside those simply won't compile to this regnode, instead going to one
of the ANYOFH flavors. This is sufficient to match all of Unicode
except for the final (private use) 65K plane.
commit 06d19438047bdc7019b9dfc6f7c85382c9c81961
Author: Karl Williamson <[email protected]>
Date: Thu Sep 19 16:04:03 2019 -0600
regexec.c: Rmv some unnecessary casts
The called macro does the cast, and this makes it more legibile
commit 1ba382328f2b84fb0b6e3dec534e4b94a0914a28
Author: Karl Williamson <[email protected]>
Date: Thu Sep 19 16:03:04 2019 -0600
l
commit e307470474b0314590c60ed21896d783b001b75a
Author: Karl Williamson <[email protected]>
Date: Thu Sep 19 15:47:51 2019 -0600
regcomp.c: Use variables initialized to macro results
instead of the macros. This is in preparation for the next commit.
commit c7d40f5b9c1aa5db8324262ebdce428a277b6a0d
Author: Karl Williamson <[email protected]>
Date: Thu Sep 19 14:20:59 2019 -0600
regcomp.c: Add parameter to static function
This further decouples this function from knowing details of the calling
structure, by passing this detail in.
commit 55623ead428e6527bb23531a451909eea8c4249f
Author: Karl Williamson <[email protected]>
Date: Wed Sep 18 13:20:42 2019 -0600
t/re/anyof.t: Add a test
This makes sure a non-folding above-Latin1 character is tested.
commit 3a9c470fcccd0bfdeb09cca96dcab6103741603e
Author: Karl Williamson <[email protected]>
Date: Thu Sep 19 14:38:39 2019 -0600
regcomp.c: Comments/white-space
Included is outdenting code whose enclosing block was removed in the
previous commit.
commit 8f4c71a4d30e80ae80c46877c913cd791ffef5f9
Author: Karl Williamson <[email protected]>
Date: Wed Sep 18 13:12:51 2019 -0600
XXX warning tests,Prefer EXACTish regnodes to ANYOFH nodes
ANYOFH nodes (that match code points above 255) are smaller than regular
ANYOF nodes because they don't have a 256-bit bitmap. But the
disadvantage of them over EXACT nodes is that the characters encountered
must first be converted from UTF-8 to code point. The difference is
less clearcut with /i, because typically, currently, the UTF-8 must also
be converted to code point in order to fold them. But the EXACTFish
node doesn't have an inversion list to do lookup in, and occupies
less space, because it doesn't have inversion list data attached to it.
Also there is a bug in using ANYOFH under /l, as wide character warnings
should be emitted if the locale isn't a UTF-8 one.
commit 465d5bf992430128d7977f603e3d215251a2001d
Author: Karl Williamson <[email protected]>
Date: Wed Sep 18 12:45:55 2019 -0600
t/re/anyof.t: Fix highest range tests
Previously we had infinity minus 1, but infinity should be beyond the
range, and the highest isn't infinity - 1, but the highest legal code
point.
commit 744bc2e985b22560a7a742569d2128c31000d21b
Author: Karl Williamson <[email protected]>
Date: Wed Sep 18 12:41:41 2019 -0600
t/re/anyof.t: Remove duplicate test
This is covered by the single code point tests.
commit 12a6b4d307ce737b4515b0714fd0f7d10368f42f
Author: Karl Williamson <[email protected]>
Date: Wed Sep 18 12:34:23 2019 -0600
t/re/anyof.t: Remove invalid test
One shouldn't be able to specify an infinite code point. The tests have
the conceit that one can specify a range's upper limit as infinity, but
that is just shorthand for the range being unbounded.
commit 349aa3cbb3fe93e171b5b00a98365235299a7599
Author: Karl Williamson <[email protected]>
Date: Wed Sep 18 12:31:11 2019 -0600
re/anyof.t: Clarify failing message
When a test fails, an extra test is run to output debugging info; this
will cause the planned number of tests to be wrong, which will output an
extra, confusing message. This adds an explanation that the number is
expected to be wrong, hence not to worry.
commit d3f35546fc92fa86225a23b02d2636977b709c32
Author: Karl Williamson <[email protected]>
Date: Thu Sep 12 20:19:07 2019 -0600
Allow some optimizations of qr/(?[...])/
Prior to this commit, this construct always returned an ANYOF node, even
if it could be optimized into something else.
commit 72cc33a64b846054aa82071093a9d6c5512c1685
Author: Karl Williamson <[email protected]>
Date: Thu May 30 20:57:27 2019 -0600
regcomp.c: Add invlist_lowest()
This function hides the invlist implementation from the calling code,
and will be called in more than one place in the future.
commit fe96c9a9be08ca89d660da6c926a3fc899e7f27a
Author: Karl Williamson <[email protected]>
Date: Thu Sep 12 21:06:45 2019 -0600
regcomp.c: Code for qr/(?[...]) handle restart
There is an existing mechanism for code to realize it needs to restart
parsing from the beginning, say because it needs to upgrade to UTF-8.
The code for /(?[...])/ did not participate in this. Currently I don't
know of any case where it needs to, though perhaps some very hard to
reproduce case when branch instructions need to start needing to handle
more than 16 bits, but I kind of doubt it. Anyway, the next few commits
introduce the possibility.
commit 381ccc56fa4f1c2af7d023c697dbc458876bc52c
Author: Karl Williamson <[email protected]>
Date: Wed Jun 26 13:02:35 2019 -0600
XXX Configure backtrace
commit 67eebd462a83809f2128e75b714b7ab6292d3770
Author: Karl Williamson <[email protected]>
Date: Sun Sep 15 16:08:13 2019 -0600
regcomp.sym: Fix comment
The length of an EXACTish node is the same bits as the FLAGS field in
other nodes; it doesn't "precede the length", as previously claimed.
-----------------------------------------------------------------------
--
Perl5 Master Repository