Hi,

(if you rather read a website than plain text, this is also on my
site: 
https://huntingbears.nl/2016/12/14/dbdmysql-all-your-utf-8-bugs-are-belong-to-us/)

After a couple of years of more or less “maintenance mode” on
DBD::mysql – we had a hand full of people contributing occasional
fixes and a whole slew of drive-by contributors – we now have a
prolific contributor again: Pali Rohár.

It’s great to see some more long-standing issues taken care of!

This time around, in the new development release 4.041_01 that is on
CPAN now (https://metacpan.org/release/MICHIELB/DBD-mysql-4.041_01),
there are some important fixes for some Unicode-related issues that I
would like to point out. The sections below I have distilled based on
the descriptions made by Pali.


Automatically converting to UTF-8 for bind parameters
------------------------------------------------------------
Before this release perl scalars (statements or bind parameters)
without UTF8 status flag were not encoded to UTF-8 even if
mysql_enable_utf8 was enabled. This caused perl scalars with internal
Latin1 encoding to be sent to the mysql server as Latin1 even if
mysql_enable_utf8 was enabled.

Now all statements and bind parameters which are not a DBI binary type
(SQL_BIT, SQL_BLOB, SQL_BINARY, SQL_VARBINARY or SQL_LONGVARBINARY)
are automatically encoded to UTF-8 when mysql_enable_utf8 is enabled.

If mysql_enable_utf8 is not enabled and your statement or bind
parameter contains a wide Unicode character then DBD::mysql shows a
warning. If a binary parameter contains a wide Unicode character then
DBD::mysql shows a warning too, similar like function print without
using a :utf8 perlio layer. (“Wide character in…”)

Perl’s SvPV() returns char* from a perl scalar and the following
SvUTF8() call for that scalar returns true if SvPV returned the data
in UTF-8 or Latin1.


Decoding of UTF-8 fields when mysql_enable_utf8 is enabled
------------------------------------------------------------
For each fetched field mysql server tells us its charset id. Before
this release when mysql_enable_utf8 was enabled DBD::mysql UTF-8
decoded all fields with a charset id different than 63 (which means
binary).

Now DBD::mysql UTF-8 decodes only those fields which have their
charset set to utf8 or utf8mb4. By default mysql server sends data in
encoding specified by SET NAMES command, which is by default Latin1.
So any received Latin1 data is not UTF-8 decoded anymore.

The mysql server sends a charset id, not a charset name. Each
combination of charset name and collation pairs has its own charset
id. A new function charsetnr_is_utf8() has hardcoded all utf8 and
utf8mb4 charset ids from mysql (up to 8.0.0) and mariadb (up to
10.2.2) from their source code. So far it looks like those ids are not
changing since old mysql 5.0, only new ones are added.

Conclusion
---------------
We hope these changes make DBD::mysql a lot more consistent for you.
Since the changes are rather big, we’d urge you to test the
development release 4.041_01 which is on CPAN and give feedback NOW;
this allows us to make changes if needed before we create an actual
stable release with these features.

And of course, if you test it with your software and all is good, we’d
like to hear that as well!

You can leave your feedback via the DBI-users mailing list
(http://lists.perl.org/list/dbi-users.html), or using our GitHub page
(https://github.com/perl5-dbi/DBD-mysql/).

Kindest regards,

Michiel

Reply via email to