Dear Tim (and Jay - there's new info here, Jay),

Jay Konigsberg originally approached me with a problem whereby an o-umlaut
character in some data was being transformed into a  two bytes with
different codes.  After paring his initial 800-line reproduction down to
just 92 lines of code, I was able to remove DBD::Informix and replace it
with DBD::NullP and demonstrate that the problem appeared there, too, and
the problem seems to be in the DBI code itself.  However, it is not
completely trivial; the reproduction still requires (seems to require)
XML::Parser::PerlSAX to have handled the data first.  Simply sucking the
data in from a file and then passing it through DBI does not seem to
trigger this reaction.  The string passed as a parameter to $sth->execute()
prints the unmodified value both before and after $sth->execute(), which
really has me puzzled.  And it is not just o-umlaut that gets mapped; other
characters such as a-acute, a-grave, e-acute, e-grave, A-acute, A-grave,
E-acute, E-grave and y-umlaut also get trampled similarly.  I've diagnosed
that the problem is in DBI because when run with PERL_DBI_DEBUG=2, the
entry for '-> execute for DBD::NullP::st (...)' shows the modified string
-- the transformation is certainly happening before DBD::NullP gets to see
it (and before DBD::Informix sees it either).

Jay is using Perl 5.8.0 on AIX 4.3.3 compiled with GCC 2.7.x; I'm using
Perl 5.8.0 compiled on Solaris 7 with GCC 3.1 but now running on Solaris 8
using GCC 3.3.  Jay is using DBI 1.32; I am using DBI 1.37.  I had to force
install libxml-perl 0.07 this morning because one test failed.  I am up to
date within a day or so on almost all the modules I have installed - I did
an update with CPANPLUS this morning (DBD::ODBC and DBD:: Multiplex are out
of date, though CPANPLUS says I've got D::M 0.90 installed and need to
install D::M 0.90, which has me confused).

Here's the test script - I'm not sure how much more it can be compressed.
It needs the file jknullp.xml, which contains all the accented characters I
mentioned.

Is there a possibility that the XML stuff is somehow setting up the Perl
Unicode system so that the Unicode thinks the characters should be recoded
from ISO 8859-1 (as explicitly stated in the XML file) and is UTF-8
encoding them?   Let's see: the input character codes are:

Name        8859-1      DBI trace         UTF-8
o-umlaut    0xF6        0xC3 0xB6         0xC3 0xB6
a-grave     0xE0        0xC3 0xA0         0xC3 0xA0
a-acute     0xE1        0xC3 0xA1         0xC3 0xA1
A-grave     0xC0        0xC3 0x2E *       0xC3 0x80
A-acute     0xC1        0xC3 0x2E *       0xC3 0x81
E-grave     0xC8        0xC3 0x2E *       0xC3 0x88
E-acute     0xC9        0xC3 0x89         0xC3 0x89
e-grave     0xE8        0xC3 0xA8         0xC3 0xA8
e-acute     0xE9        0xC3 0xA9         0xC3 0xA9
y-umlaut    0xFF        0xC3 0xBF         0xC3 0xBF

Except for the three starred characters, the DBI trace is showing a valid
mapping from ISO 8859-1 to UTF-8.  The three starred characters are invalid
UTF-8 sequences; the second byte should start with bits 10 to be valid.

Any ideas on how to prevent this transformation from occurring?  Is
reversion to Perl 5.6.1 the answer?  (Ugh if it is).  Or will 5.8.1 fix
this?  Or is it something that should not be fixed?  But then how does a
person parsing XML deal with this?  Or is it a property of the particular
XML parser that Jay is using?

HELP!!!

(See attached file: jknullp.tgz)

The tar file contains jknullp.pl (the Perl script), jknullp.trace (the
output from running jknullp.pl on Solaris 8), and jknullp.xml (the XML
source with accented characters in ISO 8859-1, as noted in the XML encoding
information).  They all unpack into the current directory.

--
Jonathan Leffler ([EMAIL PROTECTED])
STSM, Informix Database Engineering, IBM Data Management
4100 Bohannon Drive, Menlo Park, CA 94025
Tel: +1 650-926-6921   Tie-Line: 630-6921
      "I don't suffer from insanity; I enjoy every minute of it!"

Attachment: jknullp.tgz
Description: Binary data

Reply via email to