Re: DBD::mysql path forward

Darren Duncan Thu, 14 Sep 2017 11:16:52 -0700

On 2017-09-14 3:01 AM, H.Merijn Brand wrote:

On Thu, 14 Sep 2017 09:44:54 +0200, p...@cpan.org wrote:

BYTE/BLOB/TEXT tests require three types of data

• Pure ASCII
• Correct UTF-8 (with complex combinations)


subtest: Correct UTF-8 TEXT with only code points in range U+00 .. U+7F (ASCII 
subset)
subtest: Correct UTF-8 TEXT with only code points in range U+00 .. U+FF (Latin1 
subset)


ASCII:            U+000000 .. U+00007F
iso-8859-*:     + U+000080 .. U+0000FF (includes cp1252)
iso-10646:      + U+000100 .. U+0007FF
                + U+000800 .. U+00D7FF
                + U+00E000 .. U+00FFFF
utf-8 1):       + U+010000 .. U+10FFFF
                + surrogates
                + bidirectionality
                + normalization
                + collation (order by)

1) some iso-10646 implementations already support supplementary
   codepoints. Depends on the version of the standard

Regarding Unicode subtests I was going to respond to Pali's comment to say thatthere are more important ranges; H.Merijn addressed the main points I was goingto raise, however I propose a simpler set of tests as being the main ones ofimportance for data being handled without corruption.


Note, these comments apply to ALL DBI drivers, not just DBD::mysql.

There are 7 main Unicode integer codepoint ranges of interest, representable bya signed 32-bit integer:


- negative integers rejected invalid data
- 0      ..0x7F     - ASCII subset accepted
- 0x80   ..0xFF     - non ASCII 8-bit subset accepted
- 0x100  ..0xD7FF   - middle Basic Multilingual Plane accepted
- 0xD800 ..0xDFFF   - UTF-16 surrogates rejected invalid data
- 0xE000 ..0xFFFF   - upper Basic Multilingual Plane accepted
- 0x10000..0x10FFFF - the 15 supplementary planes accepted
- 0x11000 and above rejected invalid data

I would argue strongly that a transit middleware like a DBI driver shouldstrictly concern itself with the Unicode codepoint level and that it shuttlesdata back and forth preserving the exact valid codepoints given, while rejectinginvalid codepoints for both input and output either with an error or use of theunicode substitution character 0xFFFD.

While Perl itself or MySQL itself can concern itself with other matters such asgraphemes/normalization/collation/etc, a DBI driver should NOT.

Besides being logically correct, this means that DBI drivers can avoid needingcode for the most complicated aspects of Unicode, they can avoid 99% of thecomplexity.


-- Darren Duncan

Re: DBD::mysql path forward

Reply via email to