The corruption is caused by the fact that the data type of a BLOB can not
be recognized and is therefore UTF8 encoded like a string before being sent
to MySQL.
I'd like to propose the introduction of an attribute to ensure correct UTF8
encoding while maintaining backwards compatibility.
Manual page text:


mysql_encode_utf8_input
Prior 4.044 versions did not UTF8 encode the SQL statement and its input
values before being sent to MySQL but did decode the returned output. This
caused inconsistency in certain occasions.
Setting this attribute will correct this but doing so makes it necessary to
specify the data type of BLOB parameters to prevent these from being
encoded as well.
The default value is 0 to ensure backwards compatibility. Setting of this
flag is recommended for correct UTF8 handling and for easier migrations to
other database backends (DBD's).

Example with mysql_encode_utf8_input set to 0:
$dbh = DBI->connect($dsn,$user,$password, {'RaiseError' => 1,
mysql_enable_utf8mb4 => 1});
$dbh->do("INSERT INTO authors (id,blob_column,first_name) VALUES (?,?,?)",
undef, 2,$content,"Jochen");

Example with mysql_encode_utf8_input set to 1 (recommended):
$dbh = DBI->connect($dsn,$user,$password, {'RaiseError' => 1,
mysql_enable_utf8mb4 => 1 ,mysql_encode_utf8_input => 1});
$sth = $dbh->prepare("INSERT INTO authors (id,blob_column,first_name)
VALUES (?,?,?)");
$sth->bind_param(1,123,SQL_BIGINT);
$sth->bind_param(2,$content,SQL_BLOB);
$sth->bind_param(3,$first_name,SQL_VARCHAR);
$sth->execute();


On Tue, Sep 26, 2017 at 11:55 AM, demerphq <demer...@gmail.com> wrote:

> On 19 September 2017 at 14:46, Night Light <nightligh...@gmail.com> wrote:
> > Dear Perl gurus,
> >
> > This is my first post. I'm using Perl with great joy, and I'd like to
> > express my gratitude for all you are doing to keep Perl stable and fun to
> > use.
> >
> > I'd like to ask to object to re-releasing this version and discuss on
> how to
> > make 4.043 backwards compatible instead.
> > This change will with 100% certainty corrupt all BLOB data written to the
> > database when the developer did not read the release notes before
> applying
> > the latest version of DBD::mysql (and changed its code consequently).
> > Knowing that sysadmins have the habit of not always reading the release
> > notes of each updated package the likelihood that this will happen will
> > therefore high.
> > I myself wasn't even shown the release notes as it was a dependency of an
> > updated package that I applied.
> > The exposure of this change is big as DBD::mysql affects multiple
> > applications and many user bases.
> > I believe deliberately introducing industry wide database corruption is
> > something that will significantly harm peoples confidence in using Perl.
> > I believe that not providing backwards compatibility is not in line with
> the
> > Perl policy that has been carefully put together by the community to
> > maintain the quality of Perl as it is today.
> > http://perldoc.perl.org/perlpolicy.html#BACKWARD-
> COMPATIBILITY-AND-DEPRECATION
> >
> > I therefore believe the only solution is an upgrade that is by default
> > backwards compatible, and where it is the user who decides when to start
> > UTF8 encode the input values of a SQL request instead.
> > If it is too time consuming or too difficult it should be considered to
> park
> > the UTF8-encoding "fix" and release a version with the security fix
> first.
> >
> > I have the following objections against this release:
> >
> > 1. the upgrade will corrupt more records than it fixes (it does more harm
> > than good)
> > 2. the reason given for not providing backward compatibility ("because it
> > was hard to implement") is not plausible given the level of unwanted side
> > effects.
> >    This especially knowing that there is already a mechanism in place to
> > signal if its wants UTF8 encoding or not
> > (mysql_enable_utf8/mysql_enable_utf8mb4).
> > 3. it costs more resources to coordinate/discuss a "way forward" or
> options
> > than to implement a solution that addresses backwards compatibility
> > 4. it is unreasonable to ask for changing existing source knowing that
> > depending modules may not be actively maintained or proprietary
> >    It can be argued that such module should always be maintained but it
> does
> > not change the fact that a good running Perl program becomes unusable
> > 5. it does not inform the user that after upgrading existing code will
> start
> > write corrupt BLOB records
> > 6. it does not inform the user about the fact that a code review of all
> > existing code is necessary, and how it needs to be changed and tested
> > 7. it does not give the user the option to decide how the BLOB's should
> be
> > stored/encoded (opt in)
> > 8. it does not provide backwards compatibility
> >    By doing so it does not respect the Perl policy that has been
> carefully
> > put together by the community to maintain the quality of Perl as it is
> > today.
> >
> > http://perldoc.perl.org/perlpolicy.html#BACKWARD-
> COMPATIBILITY-AND-DEPRECATION
> > 9. it blocks users from using DBD::mysql upgrades as long as they have
> not
> > rewritten their existing code
> > 10. not all users from DBD::mysql can be warned beforehand about the side
> > effects as it is not known which private parties have code that use
> > DBD::mysql
> > 12. I believe development will go faster when support for backwards
> > compatibility is addressed
> > 13. having to write 1 extra line for each SQL query value is a monks job
> > that will make the module less attractive to use
> >
> > About forking to DBD::mariadb?:
> > The primary reason to create such a module is when the communication
> > protocol of Mariadb has become incompatible with Mysql.
> > To use this namespace to fix a bug in DBD::mysql does not meet that
> criteria
> > and causes confusion for developers and unnecessary pollution of the DBD
> > namespace.
> >
> > ---
> >
> > For people that do not know the impact of the change that is pending to
> be
> > committed:
> > (see Github issue that includes 3 reports of companies that suffered data
> > loss https://github.com/perl5-dbi/DBD-mysql/issues/117 )
> >
> > Issue: some UTF8 characters are not properly displayed after retrieval
> > Cause: SQL query values are not UTF8 encoded when sent to the database
> but
> > they are all decoded once retrieved.
> > Occurence: Only records with string data that can only be written with
> UTF8.
> > It can be considered rare as people haven't reported this issue after 10
> > years of usage.
> > Regional impact: Only affects countries which characters need UTF8
> encoding
> > and only affects string values.
> > Steps to recover from it: Read string data unencoded and write it
> encoded.
> >
> > Changes of upgrade pending to be re-released:
> > SQL query values are both UTF8 encoded when sent to the database as when
> its
> > retrieved (including BLOB fields).
> > BLOB fields will be excluded from encoding only if you specify its data
> > type.
> >
> > Side effects from installing upgrade:
> > - BLOB data will be written after UTF8 encoding and will therefore be
> > corrupt
> > - no possibility to detect if a BLOB field is corrupt or not. Only when
> > known when the INSERT/UPDATE took place, and when the upgrade was
> installed
>
> If the "corruption" you describe is the same as multiple-utf8
> encoding, then one can quite reliably iteratively decode the buffer
> until it contains a sequence that is not valid utf8, at which point
> you know you are done, either ending up with a utf8 encoded buffer
> that contains codepoints that cannot be downgraded, or by ending up
> with a buffer containing latin-1/binary.
>
> This is a function I wrote years ago which can fix "corrupted" utf8
> data that derives from multiple utf8 encoding. This code includes a
> couple of workarounds for issues in older Perl's and older Encodes,
> but should work fine on a modern stack.
>
> use constant LATIN1 => "iso-8859-1";
> # special sequence that must be 1 byte longer than max length of a
> utf8 encoded character
> # under sane operating conditions 5 should be sufficient.
> our $_tail; BEGIN {$_tail = "\1\1\1\1\1";}
> #-----------------------------------------------------------
> --------------------
> # recurse_decode_value($str,$to_latin1)
> #
> # Recursively decodes a string to its minimal form. If $to_latin1 is set
> then
> # returns the string in latin-1 if it possible to do so. If $to_latin1
> is not set
> # or the result contains high codepoints that can't be represented in
> latin1 then
> # the results will be unicode.
> #
> #  IN: 1 string to decode
> #      2 bool indicating whether to return latin1 if possible
> # OUT: 1 string
> #
> sub recurse_decode_value {
>     my $s= shift;
>
>     return if ref $s or not defined $s;
>     my $to_latin= shift;
>     Encode::_utf8_off($s);
>     # work around a bug in Encode - if the last item is a valid start
> sequence byte
>     # it silently swallows it when it should be trated as its latin 1
> equivalent.
>     # Specifically try: "ba\x{df}" versuse "ba\x{df}\x{df}".
>     # So we add on trailing bytes that can not be a valid
> start/interim byte, and then afterwards
>     # pull it off. - Yves
>     $s .= $_tail;   # do NOT use "\0" here! Our perl doesnt chop
> strings ending in a utf8 null correctly.
>
>     # remember the old string
>     my $old_s= "";
>     while (length($s) != length($old_s)) {
>         $old_s= $s;
>         eval {
>             my $octets= $s;
>             $s= decode_utf8( $octets, Encode::FB_CROAK );
>             $s= $old_s if substr($s, -1, 1) ne "\1";
>             1;
>         } or do {
>             last;
>         } unless utf8::is_utf8($s);
>         eval {
>             my $utf8= $s;
>             $s= encode( LATIN1, $utf8, Encode::FB_CROAK);
>             1;
>         } or do {
>             last;
>         };
>     }
>     $s=~s/\Q$_tail\E\z//o or warn "Possibly corrupted string decode: $s\n";
>     utf8::upgrade($s) unless $to_latin;
>
>     return $s;
> }  #recurse_decode_value
>
>
> --
> perl -Mre=debug -e "/just|another|perl|hacker/"
>

Reply via email to