On 19/06/2014 15:58, Andrew Hill wrote:
My code has an extremely annoying bug that I can't quite solve.

The concept is simple - read some text from a text file; update a database table based on that text.

The text file is UTF8 and the database is Oracle 11g.

I am reading the file with a normal
open FILE, "<blah";
while(<FILE>) {
    chomp;
    $foo = $_;
}

This code is likely to lead to problems.

If you open a UTF-8 file without specifying a layer, then when you read the file you will get downgrade strings containing the UTF-8 bytes, whereas what you want is upgraded strings containing the characters.

So in the case of a file containing a UTF-8 representation of "Zürich", $foo will be left with "Z\xC3\xBCrich" (and it will be a downgrade string, i.e. with Perl's internal UTF-8 flag off).

Consider instead doing something like this:

my $filename = 'blah;
open my $file, '<:encoding(UTF-8)', $filename or die "Can't open $filename: $!";

In the case of a file containing a UTF-8 representation of "Zürich", the resulting will be a string "Z\xFCrich" (and it will be an upgraded string, i.e. with Perl's internal UTF-8 flag on), which is what you want.

Then I select the VARCHAR2 field from the table into $bar, do a straight string comparison between $foo and $bar, and if they are different, I update the table with the value of $foo and output a debugging line to say that, for example, Z<splodge>rich has been updated to Zürich.

It looks like you are using Oracle. (Probably best to state that clearly in questions like this.)

DBD::Oracle does "interesting" things when you try to send downgrade strings to the database. Empirically, it seems to treat the downgrade string as a UTF-8 byte sequence. So in your case this will mean you accidentally end up writing the "right" thing to the database.

However, the next time I read Zürich from the file, I get exactly the same behaviour, ie $bar is again Z<splodge>rich, therefore $foo ne $bar and it updates the table again. I don't understand why $foo ne $bar, given I've just set the field to $foo.

When you read it back, you will get a string "Z\xFCrich", which will compare different to "Z\xC3\xBCrich", and from what you describe your program will update the database again.

You may also be getting strange behaviour with your debug output, and the pathway between your data and your monitor may not treating UTF-8 in a way which is consistently useful. Consider passing your debug output through Data::Dump::pp so that you can properly see what's going on.

So, as I see it, these are the possible causes:
1. Data is not being stored in the database as UTF8 - not sure how to check when Perl is the only tool available to query it

It is and it isn't. If you want to see what's really in your strings then you can use:

use feature 'say';
use Data::Dump 'pp';

say '$x contains ', pp($x);
say '$x is ', utf8::is_utf8($x) ? 'upgraded' : 'downgrade';

2. Conversion is occuring in the DBD driver
3. Something else because I've been staring at it for so long

FWIW, NLS_CHARACTERSET is AL32UTF8 and $ENV{NLS_LANG} is AMERICAN_AMERICA.AL32UTF8

Brilliant! You appear to have set the correct options for getting DBD::Oracle to do Unicode (reasonably) properly. That is actually the hard part :-)

Though you need to ensure that $ENV{NLS_LANG} is set to a suitable AL32UTF8 option fairly early on.

I say "(reasonably) properly" because everything is fine with DBD::Oracle provided all your strings are upgraded. If you accidentally pass a downgrade string to DBD::Oracle then strange things happen.

Most of the time you can just get away with it, because most properly patrolled borders end up generating upgraded strings anyway, even when the text is entirely in the Latin-1 range.

To be sure, use utf8::upgrade on all strings which you want to pass into DBD::Oracle. Alternatively, get DBD::Oracle fixed so that it does this for you.

Regards,

Bill

Reply via email to