On Thu, 14 Sep 2017 09:44:54 +0200, p...@cpan.org wrote: > > BYTE/BLOB/TEXT tests require three types of data > > > > • Pure ASCII > > • Correct UTF-8 (with complex combinations) > > subtest: Correct UTF-8 TEXT with only code points in range U+00 .. U+7F > (ASCII subset) > subtest: Correct UTF-8 TEXT with only code points in range U+00 .. U+FF > (Latin1 subset)
ASCII: U+000000 .. U+00007F iso-8859-*: + U+000080 .. U+0000FF (includes cp1252) iso-10646: + U+000100 .. U+0007FF + U+000800 .. U+00D7FF + U+00E000 .. U+00FFFF utf-8 1): + U+010000 .. U+10FFFF + surrogates + bidirectionality + normalization + collation (order by) 1) some iso-10646 implementations already support supplementary codepoints. Depends on the version of the standard With 100% Unicode, data my go bust if stored in UTF-8 fields Unify defines a "correct" order of combined characters. I don't know exactly what the order is, but if a letter has more than one combined characters in it, like ờ U01edd \N{LATIN SMALL LETTER O WITH HORN AND GRAVE} ȭ U0022d \N{LATIN SMALL LETTER O WITH TILDE AND MACRON} inserting "LATIN SMALL LETTER O" "WITH GRAVE" "WITH HORN" is allowed to return as "LATIN SMALL LETTER O" "WITH HORN" "WITH GRAVE" or as "LATIN SMALL LETTER O WITH GRAVE" "WITH HORN" or "LATIN SMALL LETTER O WITH HORN" "WITH GRAVE" or "LATIN SMALL LETTER O WITH HORN AND GRAVE" They all represent the same grapheme. From a user perpective when dealing with Unicode, that is fine. From a testing purpose this is not :( So, *if* you test with combining characters (that do not represent in a single codepoint) make sure it matches the Unicode defined order FYI This is why I still don't support *real* binary in perl6' Text::CSV -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
pgpHUIoEfQuKu.pgp
Description: OpenPGP digital signature