On Thu, 14 Sep 2017 09:44:54 +0200, [email protected] wrote: > > BYTE/BLOB/TEXT tests require three types of data > > > > • Pure ASCII > > • Correct UTF-8 (with complex combinations) > > subtest: Correct UTF-8 TEXT with only code points in range U+00 .. U+7F > (ASCII subset) > subtest: Correct UTF-8 TEXT with only code points in range U+00 .. U+FF > (Latin1 subset)
ASCII: U+000000 .. U+00007F
iso-8859-*: + U+000080 .. U+0000FF (includes cp1252)
iso-10646: + U+000100 .. U+0007FF
+ U+000800 .. U+00D7FF
+ U+00E000 .. U+00FFFF
utf-8 1): + U+010000 .. U+10FFFF
+ surrogates
+ bidirectionality
+ normalization
+ collation (order by)
1) some iso-10646 implementations already support supplementary
codepoints. Depends on the version of the standard
With 100% Unicode, data my go bust if stored in UTF-8 fields
Unify defines a "correct" order of combined characters. I don't know
exactly what the order is, but if a letter has more than one combined
characters in it, like
ờ U01edd \N{LATIN SMALL LETTER O WITH HORN AND GRAVE}
ȭ U0022d \N{LATIN SMALL LETTER O WITH TILDE AND MACRON}
inserting "LATIN SMALL LETTER O" "WITH GRAVE" "WITH HORN"
is allowed to return as "LATIN SMALL LETTER O" "WITH HORN" "WITH GRAVE"
or as "LATIN SMALL LETTER O WITH GRAVE" "WITH HORN" or
"LATIN SMALL LETTER O WITH HORN" "WITH GRAVE" or
"LATIN SMALL LETTER O WITH HORN AND GRAVE"
They all represent the same grapheme. From a user perpective when
dealing with Unicode, that is fine. From a testing purpose this is
not :(
So, *if* you test with combining characters (that do not represent in a
single codepoint) make sure it matches the Unicode defined order
FYI This is why I still don't support *real* binary in perl6' Text::CSV
--
H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/
using perl5.00307 .. 5.27 porting perl5 on HP-UX, AIX, and openSUSE
http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/
http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
pgpHUIoEfQuKu.pgp
Description: OpenPGP digital signature
