On 2017-09-14 3:01 AM, H.Merijn Brand wrote:
On Thu, 14 Sep 2017 09:44:54 +0200, p...@cpan.org wrote:

BYTE/BLOB/TEXT tests require three types of data

• Pure ASCII
• Correct UTF-8 (with complex combinations)

subtest: Correct UTF-8 TEXT with only code points in range U+00 .. U+7F (ASCII 
subset)
subtest: Correct UTF-8 TEXT with only code points in range U+00 .. U+FF (Latin1 
subset)

ASCII:            U+000000 .. U+00007F
iso-8859-*:     + U+000080 .. U+0000FF (includes cp1252)
iso-10646:      + U+000100 .. U+0007FF
                + U+000800 .. U+00D7FF
                + U+00E000 .. U+00FFFF
utf-8 1):       + U+010000 .. U+10FFFF
                + surrogates
                + bidirectionality
                + normalization
                + collation (order by)

1) some iso-10646 implementations already support supplementary
   codepoints. Depends on the version of the standard

Regarding Unicode subtests I was going to respond to Pali's comment to say that there are more important ranges; H.Merijn addressed the main points I was going to raise, however I propose a simpler set of tests as being the main ones of importance for data being handled without corruption.

Note, these comments apply to ALL DBI drivers, not just DBD::mysql.

There are 7 main Unicode integer codepoint ranges of interest, representable by a signed 32-bit integer:

- negative integers rejected invalid data
- 0      ..0x7F     - ASCII subset accepted
- 0x80   ..0xFF     - non ASCII 8-bit subset accepted
- 0x100  ..0xD7FF   - middle Basic Multilingual Plane accepted
- 0xD800 ..0xDFFF   - UTF-16 surrogates rejected invalid data
- 0xE000 ..0xFFFF   - upper Basic Multilingual Plane accepted
- 0x10000..0x10FFFF - the 15 supplementary planes accepted
- 0x11000 and above rejected invalid data

I would argue strongly that a transit middleware like a DBI driver should strictly concern itself with the Unicode codepoint level and that it shuttles data back and forth preserving the exact valid codepoints given, while rejecting invalid codepoints for both input and output either with an error or use of the unicode substitution character 0xFFFD.

While Perl itself or MySQL itself can concern itself with other matters such as graphemes/normalization/collation/etc, a DBI driver should NOT.

Besides being logically correct, this means that DBI drivers can avoid needing code for the most complicated aspects of Unicode, they can avoid 99% of the complexity.

-- Darren Duncan

Reply via email to