On 08/11/2014 05:23 PM, Dominic Hargreaves wrote: > I don't see any sign of a bug against DBD::Pg in the CPAN bugtracker, > and Debian now has 3.3.0 in unstable and testing.
The bug isn't DBD::Pg's fault -- hence why there's nothing we've reported -- but rather a case of it becoming _more_ correct, and there being lurking code in the bowels of DBIx::SearchBuilder that was incorrect, and now interacts poorly. Specifically: https://github.com/bestpractical/dbix-searchbuilder/blob/master/lib/DBIx/SearchBuilder/Handle.pm#L577-L579 ..which takes characters that we're trying to insert into the database and encodes them in UTF-8[1] -- which is then _double_ encoded when DBD::Pg 3.3.0 realizes that the database column is textual. Previous to 3.3.0, it accepted bytes and inserted bytes, which we would later read out as characters. Now, it accepts bytes and attempts to insert them as character codepoints, so that the data round-trips and we get the same character codepoints out. Which is more correct, as 3.2.1 relied on the "UTF-8" flag to guess if the incoming data was codepoints or bytes, which was a false presmise. Those lines are, unfortunately, only part of the problem. Other places exist in RT which blindly pass bytes (not characters) to textual columns, which need to be resolved in order for RT to work properly with DBD::Pg. In other words, the internals of RT are riddled with places that make the same false assumptions about the "UTF-8" flag as DBD::Pg 3.2.1 did, which mostly canceled each other out. > Could you say a bit more about the problem and what plans there are > to fix/workaround it for RT? Forcing a lower version of DBD::Pg isn't > a practical option in a packaged environment like Debian. I've pushed https://github.com/bestpractical/rt/tree/4.0/utf8-reckoning which addresses the deeper issues needed for RT to work. It is currently in review, and will be merged in as short order as a branch of that size can be. It passes all tests on both versions of DBD::Pg, but further testing (carefully, as it might cause data corruption with non-ASCII characters) would be appreciated. > This is a pretty serious issue. Fixing this is indeed high priority for us, as mostly-unrecoverable data corruption is never a good thing. Once the branch gets merged, I expect we'll roll release candidates in short order. - Alex [1] This is a slight lie, due to perl internals. In some rare cases, for strings which contain only codepoints which exist in ISO-8859-1, it instead encodes them in ISO-8859-1 before treating those bytes as codepoints and double-encoding in UTF-8, for all of your mojibake needs. Wonderful, no? -- RT Training - Boston, September 9-10 http://bestpractical.com/training
