On 19 September 2017 at 14:46, Night Light <nightligh...@gmail.com> wrote: > Dear Perl gurus, > > This is my first post. I'm using Perl with great joy, and I'd like to > express my gratitude for all you are doing to keep Perl stable and fun to > use. > > I'd like to ask to object to re-releasing this version and discuss on how to > make 4.043 backwards compatible instead. > This change will with 100% certainty corrupt all BLOB data written to the > database when the developer did not read the release notes before applying > the latest version of DBD::mysql (and changed its code consequently). > Knowing that sysadmins have the habit of not always reading the release > notes of each updated package the likelihood that this will happen will > therefore high. > I myself wasn't even shown the release notes as it was a dependency of an > updated package that I applied. > The exposure of this change is big as DBD::mysql affects multiple > applications and many user bases. > I believe deliberately introducing industry wide database corruption is > something that will significantly harm peoples confidence in using Perl. > I believe that not providing backwards compatibility is not in line with the > Perl policy that has been carefully put together by the community to > maintain the quality of Perl as it is today. > http://perldoc.perl.org/perlpolicy.html#BACKWARD-COMPATIBILITY-AND-DEPRECATION > > I therefore believe the only solution is an upgrade that is by default > backwards compatible, and where it is the user who decides when to start > UTF8 encode the input values of a SQL request instead. > If it is too time consuming or too difficult it should be considered to park > the UTF8-encoding "fix" and release a version with the security fix first. > > I have the following objections against this release: > > 1. the upgrade will corrupt more records than it fixes (it does more harm > than good) > 2. the reason given for not providing backward compatibility ("because it > was hard to implement") is not plausible given the level of unwanted side > effects. > This especially knowing that there is already a mechanism in place to > signal if its wants UTF8 encoding or not > (mysql_enable_utf8/mysql_enable_utf8mb4). > 3. it costs more resources to coordinate/discuss a "way forward" or options > than to implement a solution that addresses backwards compatibility > 4. it is unreasonable to ask for changing existing source knowing that > depending modules may not be actively maintained or proprietary > It can be argued that such module should always be maintained but it does > not change the fact that a good running Perl program becomes unusable > 5. it does not inform the user that after upgrading existing code will start > write corrupt BLOB records > 6. it does not inform the user about the fact that a code review of all > existing code is necessary, and how it needs to be changed and tested > 7. it does not give the user the option to decide how the BLOB's should be > stored/encoded (opt in) > 8. it does not provide backwards compatibility > By doing so it does not respect the Perl policy that has been carefully > put together by the community to maintain the quality of Perl as it is > today. > > http://perldoc.perl.org/perlpolicy.html#BACKWARD-COMPATIBILITY-AND-DEPRECATION > 9. it blocks users from using DBD::mysql upgrades as long as they have not > rewritten their existing code > 10. not all users from DBD::mysql can be warned beforehand about the side > effects as it is not known which private parties have code that use > DBD::mysql > 12. I believe development will go faster when support for backwards > compatibility is addressed > 13. having to write 1 extra line for each SQL query value is a monks job > that will make the module less attractive to use > > About forking to DBD::mariadb?: > The primary reason to create such a module is when the communication > protocol of Mariadb has become incompatible with Mysql. > To use this namespace to fix a bug in DBD::mysql does not meet that criteria > and causes confusion for developers and unnecessary pollution of the DBD > namespace. > > --- > > For people that do not know the impact of the change that is pending to be > committed: > (see Github issue that includes 3 reports of companies that suffered data > loss https://github.com/perl5-dbi/DBD-mysql/issues/117 ) > > Issue: some UTF8 characters are not properly displayed after retrieval > Cause: SQL query values are not UTF8 encoded when sent to the database but > they are all decoded once retrieved. > Occurence: Only records with string data that can only be written with UTF8. > It can be considered rare as people haven't reported this issue after 10 > years of usage. > Regional impact: Only affects countries which characters need UTF8 encoding > and only affects string values. > Steps to recover from it: Read string data unencoded and write it encoded. > > Changes of upgrade pending to be re-released: > SQL query values are both UTF8 encoded when sent to the database as when its > retrieved (including BLOB fields). > BLOB fields will be excluded from encoding only if you specify its data > type. > > Side effects from installing upgrade: > - BLOB data will be written after UTF8 encoding and will therefore be > corrupt > - no possibility to detect if a BLOB field is corrupt or not. Only when > known when the INSERT/UPDATE took place, and when the upgrade was installed
If the "corruption" you describe is the same as multiple-utf8 encoding, then one can quite reliably iteratively decode the buffer until it contains a sequence that is not valid utf8, at which point you know you are done, either ending up with a utf8 encoded buffer that contains codepoints that cannot be downgraded, or by ending up with a buffer containing latin-1/binary. This is a function I wrote years ago which can fix "corrupted" utf8 data that derives from multiple utf8 encoding. This code includes a couple of workarounds for issues in older Perl's and older Encodes, but should work fine on a modern stack. use constant LATIN1 => "iso-8859-1"; # special sequence that must be 1 byte longer than max length of a utf8 encoded character # under sane operating conditions 5 should be sufficient. our $_tail; BEGIN {$_tail = "\1\1\1\1\1";} #------------------------------------------------------------------------------- # recurse_decode_value($str,$to_latin1) # # Recursively decodes a string to its minimal form. If $to_latin1 is set then # returns the string in latin-1 if it possible to do so. If $to_latin1 is not set # or the result contains high codepoints that can't be represented in latin1 then # the results will be unicode. # # IN: 1 string to decode # 2 bool indicating whether to return latin1 if possible # OUT: 1 string # sub recurse_decode_value { my $s= shift; return if ref $s or not defined $s; my $to_latin= shift; Encode::_utf8_off($s); # work around a bug in Encode - if the last item is a valid start sequence byte # it silently swallows it when it should be trated as its latin 1 equivalent. # Specifically try: "ba\x{df}" versuse "ba\x{df}\x{df}". # So we add on trailing bytes that can not be a valid start/interim byte, and then afterwards # pull it off. - Yves $s .= $_tail; # do NOT use "\0" here! Our perl doesnt chop strings ending in a utf8 null correctly. # remember the old string my $old_s= ""; while (length($s) != length($old_s)) { $old_s= $s; eval { my $octets= $s; $s= decode_utf8( $octets, Encode::FB_CROAK ); $s= $old_s if substr($s, -1, 1) ne "\1"; 1; } or do { last; } unless utf8::is_utf8($s); eval { my $utf8= $s; $s= encode( LATIN1, $utf8, Encode::FB_CROAK); 1; } or do { last; }; } $s=~s/\Q$_tail\E\z//o or warn "Possibly corrupted string decode: $s\n"; utf8::upgrade($s) unless $to_latin; return $s; } #recurse_decode_value -- perl -Mre=debug -e "/just|another|perl|hacker/"