In perl.git, the branch smoke-me/khw-encode has been created <http://perl5.git.perl.org/perl.git/commitdiff/c2c4578f00d535f111698c2aadbb4aa2cee2caf3?hp=0000000000000000000000000000000000000000>
at c2c4578f00d535f111698c2aadbb4aa2cee2caf3 (commit) - Log ----------------------------------------------------------------- commit c2c4578f00d535f111698c2aadbb4aa2cee2caf3 Author: Karl Williamson <k...@cpan.org> Date: Thu Sep 15 09:09:07 2016 -0600 XXX incomplete: Add sv_utf8_decode_flags M embed.fnc M embed.h M proto.h M sv.c M sv.h commit cb16414c43485413e3146d3d560dcb40b088abce Author: Karl Williamson <k...@cpan.org> Date: Thu Sep 15 09:06:39 2016 -0600 perlapi: Minor clarifications to sv_utf8_decode M sv.c commit 85759ad705520295924451f81a52050a70de23c3 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 22:40:23 2016 -0600 customized M t/porting/customized.dat commit 467d40072504c288dc2ffad3dc50aeecf6448526 Author: Karl Williamson <k...@cpan.org> Date: Thu Sep 1 12:20:52 2016 -0600 Use core REPLACEMENT CHARACTER definition This allows the code to now work on EBCDIC as well. M cpan/Encode/Encode/encode.h commit 5f03026600264f8f446fd8a06d49d3b42a83b03d Author: Karl Williamson <k...@cpan.org> Date: Thu Sep 1 12:16:00 2016 -0600 XXX commit msg: Encode.xs: Rmv unused function M cpan/Encode/Encode.xs commit 17cc6f7ed3774e3f472f61e103d1a1fda982a3b1 Author: Karl Williamson <k...@cpan.org> Date: Thu Sep 1 12:12:39 2016 -0600 Encode.xs: white-space only M cpan/Encode/Encode.xs commit 1962be345e86b9fa3c90f5a6b041895b62b4149a Author: Karl Williamson <k...@cpan.org> Date: Thu Sep 1 12:12:06 2016 -0600 XXX maybe more in commit msg: Speed up Encode UTF-8 validation checking This replaces the current scheme for checking UTF-8 validity by one in which normal processing doesn't require having to decode the UTF-8 into code points. The copying of characters individually from the input to the output is changed to be a single operation for each entire span of valid input at once. Thus in the normal case, what ends up happening is a tight loop to check the validity, and then a memmove of the entire input to the output, then return. If an error is found, it copies all the valid input before the error, then handles the character in error, then positions to the next input position, and repeats the whole process starting from there. It uses the functionality available from the Perl 5 core to to look at just the bytes that comprise the UTF-8 to make the determination, converting to code points only those that are defective some how in order to display them in warnings and error messages. Thus, this does not need to know about the intricacies of UTF-8 malformations, relying on the core to handle this. This cannot be pushed to CPAN until Devel::PPPort has been updated to implement all the functions now needed. M cpan/Encode/Encode.pm M cpan/Encode/Encode.xs commit 95b7397c9f5b0bc6b6f59ea73dd254453ea66803 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 20:15:56 2016 -0600 XXX tests: Add is_utf8_buf_flags() and use it This encodes a simple pattern that may not be immediately obvious to someone needing it. If you have a fixed-size buffer that is full of purportedly UTF-8 bytes, is it valid or not? It's easy to do, as shown in this commit. The file test operators -T and -B can be simpified by using this function. M embed.fnc M embed.h M inline.h M pp_sys.c M proto.h commit 313692bc0c0c41eda9d5b85ace7621fa9dff4a07 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 20:03:16 2016 -0600 XXX Flesh out, tests: Add is_utf8_foo() M embed.fnc M embed.h M inline.h M proto.h commit 5015592ed0469e0292dde6a5ca5692b772b1510f Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 19:57:46 2016 -0600 Move #define to different header Instead of having a comment in one header pointing to the #define in the other, remove the indirection and just have the #define itself where it is needed. M inline.h M utf8.h commit 95a1ac9157055b547b0731096a6b5fb8325264c0 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 19:49:52 2016 -0600 perlapi: Clarify docs for some is_utf8_foo functions M inline.h commit 03f8936e099dc9853a80cdc5544478e3ae94048e Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 18:54:23 2016 -0600 Add isUTF8_CHAR_flags() macro This is like the previous 2 commits, but the macro takes a flags parameter so any combination of the disallowed flags may be used. The others, along with the original isUTF8_CHAR(), are the most commonly desired strictures, and use an implementation of a, hopefully, inlined trie for speed. This is for generality and the major portion of its implementation isn't inlined. M ext/XS-APItest/APItest.xs M ext/XS-APItest/t/utf8.t M utf8.h commit b34eff6f793201d3a6c30e679cbd328fd6de49e3 Author: Karl Williamson <k...@cpan.org> Date: Mon Sep 12 16:52:41 2016 -0600 Add macro for Unicode Corregindum #9 strict This macro follows Unicode Corrigendum #9 to allow non-character code points. These are still discouraged but not completely forbidden. It's best for code that isn't intended to operate on arbitrary other code text to use the original definition, but code that does things, such as source code control, should change to use this definition if it wants to be Unicode-strict. Perl can't adopt C9 wholesale, as it might create security holes in existing applications that rely on Perl keeping non-chars out. M ext/XS-APItest/APItest.xs M ext/XS-APItest/t/utf8.t M regcharclass.h M regen/regcharclass.pl M utf8.h M utfebcdic.h commit f2ee67210ff845671b84be61117b77b4653ba396 Author: Karl Williamson <k...@cpan.org> Date: Mon Sep 12 13:38:22 2016 -0600 Add macro for determining if UTF-8 is Unicode-strict M ext/XS-APItest/APItest.xs M ext/XS-APItest/t/utf8.t M regcharclass.h M regen/regcharclass.pl M utf8.h M utfebcdic.h commit bd3bc7853f81dddc4a9b4d4c7e90c579b6daa23f Author: Karl Williamson <k...@cpan.org> Date: Mon Sep 12 14:30:15 2016 -0600 perlapi: Clarify isUTF8_CHAR() M utf8.h commit 1cf20e5c1daac8241495d7ab3b7395ffd1beb574 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 17:09:51 2016 -0600 inline.h: Add 'const's; avoid hiding outer variable This changes some formal parameters to be const, and avoids reusing the same variable name within an inner block, to avoid confusion M embed.fnc M inline.h M proto.h commit 1bd1a5e91eb89c68ca877437520e7e6b29e5e530 Author: Karl Williamson <k...@cpan.org> Date: Thu Sep 8 11:34:15 2016 -0600 Add tests for is_valid_partial_utf8_char_flags() M ext/XS-APItest/APItest.xs M ext/XS-APItest/t/utf8.t commit 6823d9254c49c8c32592dec7d4c993f22ab5850d Author: Karl Williamson <k...@cpan.org> Date: Sun Sep 11 22:18:57 2016 -0600 Add is_utf8_valid_partial_char_flags() This is a generalization of is_utf8_valid_partial_char to allow the caller to automatically exclude things such as surrogates. M embed.fnc M embed.h M inline.h M proto.h commit 6df9c77fd6294ecd27190557d4b199ec003d4008 Author: Karl Williamson <k...@cpan.org> Date: Sun Sep 11 09:40:37 2016 -0600 perlapi: Reword description of is_utf8_valid_partial_char M inline.h commit c48e530e5e830cb857a0e600429ea398e1afeb18 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 22:27:37 2016 -0600 Fix off-by-one error in is_utf8_valid_partial_char() M inline.h commit 1bd73796eae740f0f363fce9fe65d1f7a4db350d Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 22:24:48 2016 -0600 handy.h: Comment memEQs and memNEs M handy.h commit 900e9387f57443f5bdbb9393f5d699ff12d1982a Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 22:18:59 2016 -0600 utf8.c: Add some UNLIKELYs M utf8.c commit 1959a7939b6acdec6cecc2a674518af85ca11398 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 22:18:16 2016 -0600 utf8.h: Add comment, white-space changes M utf8.h commit d9f80678aa773efd3c04c0bdc97ef65b00a3c381 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 22:09:44 2016 -0600 Enhance and rename is_utf8_char_slow() This changes the name of this helper function and adds a parameter and functionality to allow it to exclude problematic classes of code points, the same ones excludeable by utf8n_to_uvchar(), like surrogates or non-character code points. M embed.fnc M embed.h M inline.h M proto.h M utf8.c M utf8.h commit 188a2b00ada8185fa536473a43bab20aa7605840 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 7 22:22:01 2016 -0600 APItest/t/utf8.t: Add tests These fill in gaps in current testing. In particular all the overlong UTF-8 possible edge cases are now tested. M ext/XS-APItest/t/utf8.t commit b6f7cebb4d97b8666b24766d43e369f7fe77fea4 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 7 22:14:38 2016 -0600 APItest/utf8.t: Some clean up This adds some information to test names, does some white-space alignments, changes one test to stress things slightly more, and adds a 'use bytes' because in some cases the desired byte-oriented output was not showing up. M ext/XS-APItest/t/utf8.t commit 0ff82715eeaee878beafe899a0dca8c6f670cec0 Author: Karl Williamson <k...@cpan.org> Date: Sun Sep 4 21:32:08 2016 -0600 Test isUTF8_CHAR() M ext/XS-APItest/APItest.xs M ext/XS-APItest/t/utf8.t commit 59c60a40e62af5eabbbc6fe073120d5d2daac783 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 22:19:42 2016 -0600 lib/warnings/utf8: Reinstate warning test I removed this in 35f8c9bd0ff4f298f8bc09ae9848a14a9667a95a, thinking the warning was no longer being raised. But in fact, it was showing a bug, now fixed by the previous commit. M t/lib/warnings/utf8 commit 0094884088c3d72085333f53c123e60e5ab04bd4 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 21:15:04 2016 -0600 Revamp overlong handling in is_utf8_char_slow, fixing a bug This combines EBCDIC and ASCII branches as much as possible, and fixes a bug that showed up only on EBCDIC platforms, and 64-bit ASCII ones for the highest overlong, where it could erroneously conclude that a sequence was an overlong. M utf8.c commit a4f913a9ff912ba9d59d1ea42a91fb0e407efe0b Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 21:43:42 2016 -0600 Forbid UTF-8 start bytes 0x FF on 32-bit ASCII These all are for code points that won't fit into a 32 bit word. M utf8.h commit 62d802bd3241f7cd03f406344de7516a1cbc2ba8 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 21:06:39 2016 -0600 utf8.c: Fix typo in comment, add some comments M utf8.c commit f04b0b8d2d2525eed3f5cbfd17ad04f7d6433c10 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 09:00:03 2016 -0600 utf8.c: Extract duplicate code to common fcn Actually the code isn't quite duplicate, but should be because one instance is wrong. This failure would only show up on 64-bit EBCDIC platforms. M embed.fnc M embed.h M ext/XS-APItest/t/utf8.t M proto.h M utf8.c commit 4adf9e30152c6b04a2be2384ff2f08eba17d7ab3 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 08:54:36 2016 -0600 handy.h: Add memLT, memLE, memGT, memGE These correspond to strLT, etc. I am deferring documenting them in case this turns out to be a bad idea for some reason. M handy.h commit 2ceab79252696bcdcd1a85aa33f7894890124f8f Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 08:46:18 2016 -0600 XXX unconditionally do memcmp if not sane M perl.h commit 45c86a51c68c42f7b5dccb4685a9e1edf5e4868f Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 3 14:12:27 2016 -0600 isUTF8_CHAR(): Bring UTF-EBCDIC to parity with ASCII This changes the macro isUTF8_CHAR to have the same number of code points built-in for EBCDIC as ASCII. This obsoletes the IS_UTF8_CHAR_FAST macro, which is removed. Previously, the code generated by regen/regcharclass.pl for ASCII platforms was hand copied into utf8.h, and LIKELY's manually added, then the generating code was commented out. Now this has been done with EBCDIC platforms as well. This makes regenerating regcharclass.h faster. The copied macro in utf8.h is moved by this commit to within the main code section for non-EBCDIC compiles, cutting the number of #ifdef's down, and the comments about it changed somewhat. M regcharclass.h M regen/regcharclass.pl M utf8.h M utfebcdic.h commit 8d9b3365a77be3b6ad6cfbfe520b458da2e08f7e Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 3 12:15:29 2016 -0600 regen/regcharclass.pl: surrogates are code points They are not "characters" M regcharclass.h M regen/regcharclass.pl commit 0899da10c082013172c212fd291d9e558c849339 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 3 16:13:15 2016 -0600 Add IS_UTF8_INVARIANT and IS_UVCHR_INVARIANT to API M utf8.h commit a2432ca2e2ba85e8c8c00ef4febf80c842fb5d44 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 7 22:03:21 2016 -0600 utfebcdic.h: Fix typo in comment M utfebcdic.h commit 26c211867588c59e51aae4b9132dba1a35dcb364 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 16:05:35 2016 -0600 Add #defines for XS code for Unicode Corregindum 9 These are convenience macros. M utf8.c M utf8.h commit 056961ce93cc98dc2f60658fc864f7393ab98942 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 16:02:50 2016 -0600 perlapi: Clarify utf8n_to_uvchr entry M utf8.c commit e3fbbd1878d66b0d7d180ed8526964c7124e32d9 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 15:57:34 2016 -0600 perlunicode: Fix typo M pod/perlunicode.pod commit 5f4c87effa7a251db8fbc5d04dbb05b59cd98291 Author: Karl Williamson <k...@cpan.org> Date: Tue Sep 13 16:40:44 2016 -0600 append_utf8_from_native_byte: Add parens for clarity I can never remember the precedence of dereference and ++. M inline.h ----------------------------------------------------------------------- -- Perl5 Master Repository