Re: Character encodings and databases

William Blunn Fri, 20 Jun 2014 12:26:09 -0700

On 20/06/2014 09:37, Andrew Hill wrote:

On Thu, 19 Jun 2014, William Blunn wrote:
When you read it back, you will get a string "Z\xFCrich", which willcompare different to "Z\xC3\xBCrich", and from what you describe yourprogram will update the database again.
Indeed.
I have done what I should have done in the first place and replicatedthe problem in a simple form rather than try and describe what theoriginal code was doing. So:
$ENV{NLS_LANG} = "AMERICAN_AMERICA.AL32UTF8";
$ENV{ORACLE_SID} = "blah";
$ENV{ORAPIPES} = "V2";
$ENV{ORACLE_HOME} = "/home/oracle/app/oracle/product/11.2.0/dbhome_1";
my $dbh = DBI->connect("dbi:Oracle:host=127.0.0.1;sid=blah", "blah","blah", {ora_charset => 'AL32UTF8'});
$dbh->{AutoCommit} = 0; $dbh->{RaiseError} = 1; $dbh->{PrintError} = 1;
$dbh->{FetchHashKeyName} = 'NAME_lc';
my $zurich = "Zürich";


OK.

Stop, stop, stop, stop, stop.

Whilst Perl has perfectly sane and well-behaved semantics for UTF-8 insource code, we have no idea whether you have specified "use utf8;" atthe top of your source file, or how your source code got from the fileto your e-mail program. So we cannot know for sure what data will end upin the variable $zurich.

If you /really/ know what you are doing, you /can/ put Unicodecharacters in source files, but it is generally not recommended.

The quick answer which will lead to the least confusion is to make yourPerl source files be entirely ASCII. Do this as the default, and youwill save yourself and others a lot of headaches.

Right here, right now, this is one massive source of confusion that wecould really really do without.


Change this line for these two lines:

my $zurich = "Z\xFCrich";
utf8::upgrade($zurich);

This will, quickly and reliably, set up $zurich with data which isrepresentative of what you might get out of a properly configured I/O layer.

Don't skip the second line. You shouldn't need to explicitly upgradenormally, but right here, right now, it will ensure that things arecorrect and make things work correctly.

$dbh->do("delete from foo");
my $sth = $dbh->prepare("INSERT INTO foo (bar) values (?)");
$sth->execute($zurich);
$dbh->commit;
$sth = $dbh->prepare("select bar from foo");
$sth->execute;
my$foo = ($sth->fetchrow_array)[0];
$sth->finish;
$dbh->disconnect;
print HexDump $foo;
print HexDump $zurich;

You haven't shown the code which allows the reader to determine whatHexDump is. I am guessing that it emits the individual codepoints in theinput string in hexadecimal.

This outputs:
00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F0123456789ABCDEF
00000000  5A FC 72 69 63 68 Z.rich
00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F0123456789ABCDEF
00000000  5A C3 BC 72 69 63 68 Z..rich

Whilst I can't be sure, it looks --- again --- like the database isdoing the right thing, but you are setting up your input data incorrectly.

I /suspect/ that your text editor is working in UTF-8, but you have notspecified "use utf8;" at the top of your source file.

Thus, the variable $zurich will be initialised to "Z\xC3\BCrich" (whichis not what you want), and it will be a downgrade string (which is notwhat you want).

When you write this to Oracle, you will get the odd behaviour whichhappens when it is passed a downgrade string, viz. it will treat thestring as a UTF-8 byte sequence.

This will accidentally result in the desired string "Zürich" beingwritten to the database.


When you read it back, you get "Z\FCrich", which is correct.

And:

SQL> select dump(bar) from foo;

DUMP(BAR)
--------------------------------------------------------------------------------
Typ=1 Len=7: 90,195,188,114,105,99,104
So I know that the string is stored correctly in the table,accidentally or otherwise.


Yes, that looks like the right underlying byte sequence is in the database.

Here is a test I did against an Oracle instance which is properlyconfigured with AL32UTF8:


CREATE TABLE "foo" ( "bar" VARCHAR2(100) );
INSERT INTO "foo" ("bar") VALUES ("Zürich");
SELECT "bar", LENGTH("bar"), DUMP("bar") FROM "foo";

bar : Zürich
LENGTH("bar") : 6
DUMP("bar") : Typ=1 Len=7: 90,195,188,114,105,99,104

Try my SELECT against your database to confirm the (character) length as 6.

So we can see that there are 6 characters in 7 bytes.

I know that the string is being stored correctly in $zurich,

I don't think you do. In fact your Hexdump even shows that it's beingrepresented as a UTF-8 sequence rather than the proper sequence ofcodepoints.

So it seems that the opposite is true and the string is being storedincorrectly in $zurich.

as I have eliminated IO.

No.

When Perl reads your source file that is also I/O. Without "use utf8;",any UTF-8 text will be read as UTF-8 byte sequences rather than Unicodestrings.

It appears it is specifically the act of fetching the row that causesthe "wrong" data to be returned.


No. Fetching the the data appears to be working correctly.

What have I missed?

I think perhaps you think that you expect to see UTF-8 sequences insidePerl. This is wrong. Inside Perl, strings should be a sequence of(Unicode) codepoints.

I/O is typically byte-oriented. When you read data from outside Perl,you get a UTF-8 byte sequence. This should be converted by an I/O layerinto a Perl (Unicode) string.

When you write a string out of Perl, the Perl (Unicode) string will beconverted by an I/O layer into a UTF-8 byte sequence.


Fundamentally, it is straightforward.

Try the change I suggested above and see if your test program makes moresense.

One other thing you might like to put at the top of your test programsis this:


binmode STDOUT, ':encoding(UTF-8)';

Then provided you terminal/console program is configured for UTF-8, thenany strings you print should be displayed correctly.


Regards,

Bill

Re: Character encodings and databases

Reply via email to