On 2017-09-14 3:01 AM, H.Merijn Brand wrote:
On Thu, 14 Sep 2017 09:44:54 +0200, p...@cpan.org wrote:
BYTE/BLOB/TEXT tests require three types of data
• Pure ASCII
• Correct UTF-8 (with complex combinations)
subtest: Correct UTF-8 TEXT with only code points in range U+00 .. U+7F (ASCII
subset)
subtest: Correct UTF-8 TEXT with only code points in range U+00 .. U+FF (Latin1
subset)
ASCII: U+000000 .. U+00007F
iso-8859-*: + U+000080 .. U+0000FF (includes cp1252)
iso-10646: + U+000100 .. U+0007FF
+ U+000800 .. U+00D7FF
+ U+00E000 .. U+00FFFF
utf-8 1): + U+010000 .. U+10FFFF
+ surrogates
+ bidirectionality
+ normalization
+ collation (order by)
1) some iso-10646 implementations already support supplementary
codepoints. Depends on the version of the standard
Regarding Unicode subtests I was going to respond to Pali's comment to say that
there are more important ranges; H.Merijn addressed the main points I was going
to raise, however I propose a simpler set of tests as being the main ones of
importance for data being handled without corruption.
Note, these comments apply to ALL DBI drivers, not just DBD::mysql.
There are 7 main Unicode integer codepoint ranges of interest, representable by
a signed 32-bit integer:
- negative integers rejected invalid data
- 0 ..0x7F - ASCII subset accepted
- 0x80 ..0xFF - non ASCII 8-bit subset accepted
- 0x100 ..0xD7FF - middle Basic Multilingual Plane accepted
- 0xD800 ..0xDFFF - UTF-16 surrogates rejected invalid data
- 0xE000 ..0xFFFF - upper Basic Multilingual Plane accepted
- 0x10000..0x10FFFF - the 15 supplementary planes accepted
- 0x11000 and above rejected invalid data
I would argue strongly that a transit middleware like a DBI driver should
strictly concern itself with the Unicode codepoint level and that it shuttles
data back and forth preserving the exact valid codepoints given, while rejecting
invalid codepoints for both input and output either with an error or use of the
unicode substitution character 0xFFFD.
While Perl itself or MySQL itself can concern itself with other matters such as
graphemes/normalization/collation/etc, a DBI driver should NOT.
Besides being logically correct, this means that DBI drivers can avoid needing
code for the most complicated aspects of Unicode, they can avoid 99% of the
complexity.
-- Darren Duncan