In perl.git, the branch smoke-me/khw-encode has been created
<http://perl5.git.perl.org/perl.git/commitdiff/583134622b558a2aaa0ccd194c14bf1ae78e1a78?hp=0000000000000000000000000000000000000000>
at 583134622b558a2aaa0ccd194c14bf1ae78e1a78 (commit)
- Log -----------------------------------------------------------------
commit 583134622b558a2aaa0ccd194c14bf1ae78e1a78
Author: Karl Williamson <[email protected]>
Date: Thu Sep 15 09:09:07 2016 -0600
XXX incomplete: Add sv_utf8_decode_flags
M embed.fnc
M embed.h
M proto.h
M sv.c
M sv.h
commit 341d064f4929671f78f53f392d40e8e29f4f4c9a
Author: Karl Williamson <[email protected]>
Date: Wed Sep 14 22:40:23 2016 -0600
customized
M t/porting/customized.dat
commit 4068078f7af608a99f148c2052e1a6e9f25e1b05
Author: Karl Williamson <[email protected]>
Date: Thu Sep 1 12:20:52 2016 -0600
Use core REPLACEMENT CHARACTER definition
This allows the code to now work on EBCDIC as well.
M cpan/Encode/Encode/encode.h
commit ee9b0e6e602dfdd8cfe133ba251dcb964b5b7e59
Author: Karl Williamson <[email protected]>
Date: Thu Sep 1 12:16:00 2016 -0600
XXX commit msg: Encode.xs: Rmv unused function
M cpan/Encode/Encode.xs
commit 9be566b07dbc37351360acba68a28bdeb68b28fd
Author: Karl Williamson <[email protected]>
Date: Thu Sep 1 12:12:39 2016 -0600
Encode.xs: white-space only
M cpan/Encode/Encode.xs
commit ab7c6894c26b478a7c48f775a2a6f517f6355e20
Author: Karl Williamson <[email protected]>
Date: Thu Sep 1 12:12:06 2016 -0600
XXX maybe more in commit msg: Speed up Encode UTF-8 validation checking
This replaces the current scheme for checking UTF-8 validity by one
in which normal processing doesn't require having to decode the UTF-8
into code points. The copying of characters individually from the input
to the output is changed to be a single operation for each entire span
of valid input at once.
Thus in the normal case, what ends up happening is a tight loop to
check the validity, and then a memmove of the entire input to the
output, then return.
If an error is found, it copies all the valid input before the error,
then handles the character in error, then positions to the next input
position, and repeats the whole process starting from there.
It uses the functionality available from the Perl 5 core to to look at
just the bytes that comprise the UTF-8 to make the determination,
converting to code points only those that are defective some how in
order to display them in warnings and error messages.
Thus, this does not need to know about the intricacies of UTF-8
malformations, relying on the core to handle this.
This cannot be pushed to CPAN until Devel::PPPort has been updated to
implement all the functions now needed.
M cpan/Encode/Encode.pm
M cpan/Encode/Encode.xs
commit 8fdf9723488670ca1775f9ff3faaac63f8ef62b0
Author: Karl Williamson <[email protected]>
Date: Mon Oct 10 21:18:37 2016 -0600
XXX pod, delta: Add utf8n_to_uvchr_error
This new function behaves like utf8n_to_uvchr(), but takes an extra
parameter that points to a U32 which will be set to 0 if no errors are
found; otherwise each error found will set a bit in it. This can be
used by the caller to figure out precisely what the error(s) is/are.
Previously, one would have to capture and parse the warning/error
messages raised. This can be used, for example, to customize the
messages to the expected end-user's knowledge level.
M embed.fnc
M embed.h
M ext/XS-APItest/APItest.xs
M ext/XS-APItest/t/utf8.t
M proto.h
M utf8.c
M utf8.h
commit 78d01b0f31637f805415baf35079fcc810f5749b
Author: Karl Williamson <[email protected]>
Date: Sat Oct 8 21:19:18 2016 -0600
utf8n_to_uvchr(): Make a parameter const
M embed.fnc
M proto.h
M utf8.c
commit bf79738e76567fe04e6c89f05e1ebe03dccc0a17
Author: Karl Williamson <[email protected]>
Date: Wed Oct 5 19:09:02 2016 -0600
utf8n_to_uvchr(): Note multiple malformations
Some UTF-8 sequences can have multiple malformations. For example, a
sequence can be the start of an overlong representation of a code point,
and still be incomplete. Until this commit what was generally done was
to stop looking when the first malformation was found. This was not
correct behavior, as that malformation may be allowed, while another
unallowed one went unnoticed. This commit refactors the error handling
of this function to set a flag and keep going if a malformation is found
that doesn't precude others. Then each is handled in a loop at the end,
warning if warranted. The result is that there is a warning for each
malformation for which warnings should be generated, and an error return
is made if any one is disallowed.
In the case of overflow, this automatically is for a non-Unicode code
point and for one above 31 bits; these are not independent
malformations, so only one warning is output--the most dire.
This will speed up the normal case slightly, as the test for overflow is
pulled out of the loop, allowing the UV to overflow. Then a single test
after the loop is done to see if there was overflow or not.
M ext/XS-APItest/t/utf8.t
M pod/perldiag.pod
M t/op/utf8decode.t
M utf8.c
M utf8.h
commit 884bdbe3cc3353d353dee19293ac1b494aa516ba
Author: Karl Williamson <[email protected]>
Date: Sat Oct 8 20:53:31 2016 -0600
APItest/t/utf8.t: Fix improper tests
These two tests are overlong malformations, besides being the ones
purportedly being tested. Make them not overlong, so are testing just
one thing
M ext/XS-APItest/t/utf8.t
commit 4e4ac410df520b4be0c903b1cbb782e96742e963
Author: Karl Williamson <[email protected]>
Date: Fri Oct 7 15:07:57 2016 -0600
APItest/t/utf8.t: Indent a bunch of code
And reflow to fit in 80 columns. This is in preparation for the next
commit which will enlocde this new code with two more for loops.
Several lines that were missing semi-colons have these added (they were
at the end of nested blocks, so it wasn't an error)
M ext/XS-APItest/t/utf8.t
commit 9edd73835be25396c6c21796184af257e3db8a4a
Author: Karl Williamson <[email protected]>
Date: Wed Oct 5 18:34:15 2016 -0600
APItest/t/utf8.t: Add missing test
Under some circumstances we weren't validating that the generated
warnings are correct. This required reordering some 'if' tests, and
revised special casing of the overflow test.
M ext/XS-APItest/t/utf8.t
commit 70de9a99a4b79d5527d8e3f3a1aa5a4d6b03d006
Author: Karl Williamson <[email protected]>
Date: Wed Oct 5 18:32:55 2016 -0600
APItest/t/utf8.t: Rename test for clarity
M ext/XS-APItest/t/utf8.t
commit 702eed37a7232b9b1ffe0619e533e98aa436129c
Author: Karl Williamson <[email protected]>
Date: Sun Oct 2 21:50:10 2016 -0600
utf8.c: Extract some code into 2 functions
This is in preparation for the same functionality to each be used in a
new place in a future commit
M embed.fnc
M embed.h
M proto.h
M utf8.c
commit d9132190f6fbcb3794d867e57c6fb4b0f13b07ae
Author: Karl Williamson <[email protected]>
Date: Sun Oct 2 21:31:52 2016 -0600
utf8.c: Rename a couple of macros for clarity
These were recently added in 2b47960981adadbe81b9635d4ca7861c45ccdced.
This also removes the #undefs of these in preparation for them to be
used later in the file.
M utf8.c
commit f1f8aa4d9f9546f67ecc8d2a9cd8bee0b0499aef
Author: Karl Williamson <[email protected]>
Date: Sun Oct 2 21:09:27 2016 -0600
utf8.h: Change some flag definition constants
These #defines give flag bits in a U32. This commit opens a gap that
will be filled in a future commit. A test file has to change to
correspond, as it duplicates the defines.
M ext/XS-APItest/t/utf8.t
M utf8.h
commit 59fb8148d8e008ad496a84101922c7041c673835
Author: Karl Williamson <[email protected]>
Date: Sun Oct 2 21:05:15 2016 -0600
APItest/t/utf8.t: Extract code to common function
There are many instances of this simple code to dump an array of trapped
warning messages. The problem is that they display better when joined
by "" rather than by a comma. Rather than change each instance to do
that, I changed each instance to a sub call and changed it there.
M ext/XS-APItest/t/utf8.t
commit 6d533da66a4afd8b6358db96647e221aef87f0b0
Author: Karl Williamson <[email protected]>
Date: Fri Sep 30 12:42:45 2016 -0600
utf8.c: Add some UNLIKELY()s
for branch prediction
M utf8.c
commit a756f28e2ce0a6142b1eb16a13855ffd6cdaa34f
Author: Karl Williamson <[email protected]>
Date: Wed Sep 28 15:05:17 2016 -0600
Add details to UTF-8 malformation error messages
I've long been unsatisfied with the information contained in the
error/warning messages raised when some input is malformed UTF-8, but
have been reluctant to change the text in case some one is relying on
it. One reason that someone might be parsing the messages is that there
has been no convenient way to otherwise pin down what the exact
malformation might be. A couple of commits from now will add a facility
to get the type of malformation unambiguously. This will be a better
mechanism to use for those rare modules that need to know what's the
exact malformation.
So, I will fix and issue pull requests for any module broken by this
commit.
The messages are changed by now dumping (in \xXY format) the bytes that
make up the malformed character, and extra details are added in most
cases.
Messages about overlongs now display the code point they evaluate to and
what the shortest UTF-8 sequence for generating that code point is.
Messages about overflowing now just display that it overflows, since the
entire byte sequence is now dumped. The previous message displayed just
the byte which was being processed where overflow was detected, but that
is not helpful at all.
M embed.fnc
M embed.h
M ext/XS-APItest/t/utf8.t
M lib/utf8.t
M proto.h
M t/io/utf8.t
M t/lib/warnings/utf8
M t/op/pack.t
M t/op/utf8decode.t
M utf8.c
commit 91ad2c4fb2e9b2a98b547994d70466bce78625c6
Author: Karl Williamson <[email protected]>
Date: Wed Sep 28 10:19:03 2016 -0600
utf8.c: Consolidate duplicate error msg text
This text is generated in 2 places; consolidate into one place.
M embed.fnc
M embed.h
M proto.h
M utf8.c
-----------------------------------------------------------------------
--
Perl5 Master Repository