Re: Character encodings and databases

William Blunn Thu, 19 Jun 2014 12:08:53 -0700

On 19/06/2014 15:58, Andrew Hill wrote:

My code has an extremely annoying bug that I can't quite solve.
The concept is simple - read some text from a text file; update adatabase table based on that text.
The text file is UTF8 and the database is Oracle 11g.

I am reading the file with a normal
open FILE, "<blah";
while(<FILE>) {
    chomp;
    $foo = $_;
}


This code is likely to lead to problems.

If you open a UTF-8 file without specifying a layer, then when you readthe file you will get downgrade strings containing the UTF-8 bytes,whereas what you want is upgraded strings containing the characters.

So in the case of a file containing a UTF-8 representation of "Zürich",$foo will be left with "Z\xC3\xBCrich" (and it will be a downgradestring, i.e. with Perl's internal UTF-8 flag off).


Consider instead doing something like this:

my $filename = 'blah;

open my $file, '<:encoding(UTF-8)', $filename or die "Can't open$filename: $!";

In the case of a file containing a UTF-8 representation of "Zürich", theresulting will be a string "Z\xFCrich" (and it will be an upgradedstring, i.e. with Perl's internal UTF-8 flag on), which is what you want.

Then I select the VARCHAR2 field from the table into $bar, do astraight string comparison between $foo and $bar, and if they aredifferent, I update the table with the value of $foo and output adebugging line to say that, for example, Z<splodge>rich has beenupdated to Zürich.

It looks like you are using Oracle. (Probably best to state that clearlyin questions like this.)

DBD::Oracle does "interesting" things when you try to send downgradestrings to the database. Empirically, it seems to treat the downgradestring as a UTF-8 byte sequence. So in your case this will mean youaccidentally end up writing the "right" thing to the database.

However, the next time I read Zürich from the file, I get exactly thesame behaviour, ie $bar is again Z<splodge>rich, therefore $foo ne$bar and it updates the table again. I don't understand why $foo ne$bar, given I've just set the field to $foo.

When you read it back, you will get a string "Z\xFCrich", which willcompare different to "Z\xC3\xBCrich", and from what you describe yourprogram will update the database again.

You may also be getting strange behaviour with your debug output, andthe pathway between your data and your monitor may not treating UTF-8 ina way which is consistently useful. Consider passing your debug outputthrough Data::Dump::pp so that you can properly see what's going on.

So, as I see it, these are the possible causes:
1. Data is not being stored in the database as UTF8 - not sure how tocheck when Perl is the only tool available to query it

It is and it isn't. If you want to see what's really in your stringsthen you can use:


use feature 'say';
use Data::Dump 'pp';

say '$x contains ', pp($x);
say '$x is ', utf8::is_utf8($x) ? 'upgraded' : 'downgrade';

2. Conversion is occuring in the DBD driver
3. Something else because I've been staring at it for so long
FWIW, NLS_CHARACTERSET is AL32UTF8 and $ENV{NLS_LANG} isAMERICAN_AMERICA.AL32UTF8

Brilliant! You appear to have set the correct options for gettingDBD::Oracle to do Unicode (reasonably) properly. That is actually thehard part :-)

Though you need to ensure that $ENV{NLS_LANG} is set to a suitableAL32UTF8 option fairly early on.

I say "(reasonably) properly" because everything is fine withDBD::Oracle provided all your strings are upgraded. If you accidentallypass a downgrade string to DBD::Oracle then strange things happen.

Most of the time you can just get away with it, because most properlypatrolled borders end up generating upgraded strings anyway, even whenthe text is entirely in the Latin-1 range.

To be sure, use utf8::upgrade on all strings which you want to pass intoDBD::Oracle. Alternatively, get DBD::Oracle fixed so that it does thisfor you.


Regards,

Bill

Re: Character encodings and databases

Reply via email to