On 20/06/2014 09:37, Andrew Hill wrote:
On Thu, 19 Jun 2014, William Blunn wrote:

When you read it back, you will get a string "Z\xFCrich", which will compare different to "Z\xC3\xBCrich", and from what you describe your program will update the database again.

Indeed.

I have done what I should have done in the first place and replicated the problem in a simple form rather than try and describe what the original code was doing. So:

$ENV{NLS_LANG} = "AMERICAN_AMERICA.AL32UTF8";
$ENV{ORACLE_SID} = "blah";
$ENV{ORAPIPES} = "V2";
$ENV{ORACLE_HOME} = "/home/oracle/app/oracle/product/11.2.0/dbhome_1";
my $dbh = DBI->connect("dbi:Oracle:host=127.0.0.1;sid=blah", "blah", "blah", {ora_charset => 'AL32UTF8'});
$dbh->{AutoCommit} = 0; $dbh->{RaiseError} = 1; $dbh->{PrintError} = 1;
$dbh->{FetchHashKeyName} = 'NAME_lc';
my $zurich = "Zürich";

OK.

Stop, stop, stop, stop, stop.

Whilst Perl has perfectly sane and well-behaved semantics for UTF-8 in source code, we have no idea whether you have specified "use utf8;" at the top of your source file, or how your source code got from the file to your e-mail program. So we cannot know for sure what data will end up in the variable $zurich.

If you /really/ know what you are doing, you /can/ put Unicode characters in source files, but it is generally not recommended.

The quick answer which will lead to the least confusion is to make your Perl source files be entirely ASCII. Do this as the default, and you will save yourself and others a lot of headaches.

Right here, right now, this is one massive source of confusion that we could really really do without.

Change this line for these two lines:

my $zurich = "Z\xFCrich";
utf8::upgrade($zurich);

This will, quickly and reliably, set up $zurich with data which is representative of what you might get out of a properly configured I/O layer.

Don't skip the second line. You shouldn't need to explicitly upgrade normally, but right here, right now, it will ensure that things are correct and make things work correctly.

$dbh->do("delete from foo");
my $sth = $dbh->prepare("INSERT INTO foo (bar) values (?)");
$sth->execute($zurich);
$dbh->commit;
$sth = $dbh->prepare("select bar from foo");
$sth->execute;
my$foo = ($sth->fetchrow_array)[0];
$sth->finish;
$dbh->disconnect;
print HexDump $foo;
print HexDump $zurich;

You haven't shown the code which allows the reader to determine what HexDump is. I am guessing that it emits the individual codepoints in the input string in hexadecimal.

This outputs:

00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F 0123456789ABCDEF

00000000  5A FC 72 69 63 68 Z.rich
00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F 0123456789ABCDEF

00000000  5A C3 BC 72 69 63 68 Z..rich

Whilst I can't be sure, it looks --- again --- like the database is doing the right thing, but you are setting up your input data incorrectly.

I /suspect/ that your text editor is working in UTF-8, but you have not specified "use utf8;" at the top of your source file.

Thus, the variable $zurich will be initialised to "Z\xC3\BCrich" (which is not what you want), and it will be a downgrade string (which is not what you want).

When you write this to Oracle, you will get the odd behaviour which happens when it is passed a downgrade string, viz. it will treat the string as a UTF-8 byte sequence.

This will accidentally result in the desired string "Zürich" being written to the database.

When you read it back, you get "Z\FCrich", which is correct.

And:

SQL> select dump(bar) from foo;

DUMP(BAR)
--------------------------------------------------------------------------------
Typ=1 Len=7: 90,195,188,114,105,99,104


So I know that the string is stored correctly in the table, accidentally or otherwise.

Yes, that looks like the right underlying byte sequence is in the database.

Here is a test I did against an Oracle instance which is properly configured with AL32UTF8:

CREATE TABLE "foo" ( "bar" VARCHAR2(100) );
INSERT INTO "foo" ("bar") VALUES ("Zürich");
SELECT "bar", LENGTH("bar"), DUMP("bar") FROM "foo";

bar : Zürich
LENGTH("bar") : 6
DUMP("bar") : Typ=1 Len=7: 90,195,188,114,105,99,104

Try my SELECT against your database to confirm the (character) length as 6.

So we can see that there are 6 characters in 7 bytes.

I know that the string is being stored correctly in $zurich,

I don't think you do. In fact your Hexdump even shows that it's being represented as a UTF-8 sequence rather than the proper sequence of codepoints.

So it seems that the opposite is true and the string is being stored incorrectly in $zurich.

as I have eliminated IO.

No.

When Perl reads your source file that is also I/O. Without "use utf8;", any UTF-8 text will be read as UTF-8 byte sequences rather than Unicode strings.

It appears it is specifically the act of fetching the row that causes the "wrong" data to be returned.

No. Fetching the the data appears to be working correctly.

What have I missed?

I think perhaps you think that you expect to see UTF-8 sequences inside Perl. This is wrong. Inside Perl, strings should be a sequence of (Unicode) codepoints.

I/O is typically byte-oriented. When you read data from outside Perl, you get a UTF-8 byte sequence. This should be converted by an I/O layer into a Perl (Unicode) string.

When you write a string out of Perl, the Perl (Unicode) string will be converted by an I/O layer into a UTF-8 byte sequence.

Fundamentally, it is straightforward.

Try the change I suggested above and see if your test program makes more sense.

One other thing you might like to put at the top of your test programs is this:

binmode STDOUT, ':encoding(UTF-8)';

Then provided you terminal/console program is configured for UTF-8, then any strings you print should be displayed correctly.

Regards,

Bill

Reply via email to