hi mark,
What happens if you explicitly specify the table character set to be
'utf-8'? (i.e. you're relying on the database default character set to
take care of that for you right now)...
'CREATE TABLE foo CHARACTER SET utf8'
the same.
All I can say is that with the testcase I posted, it is shown that what
you put in in UTF-8 format is what you get out, byte-for-byte with no
double transformations (getBytes() _never_ uses charset information, so
comparing ResultSet.getBytes() with a String.getBytes("utf-8") shows
that the data is retrieved in UTF-8 form).
here we are in strong agreement. what you put in you get out. ;-)
anyway i guess we are nearing to an end. i have discovered a way
cool feature of "SQL Query Plugin" (i named it wrongly SQLExplorer)
by stefan stiller.
you can switch the display-format of the query result if you like.
for example you can display your strings as bytes in different
encodings. guess what i did ;-)
lets see the results. i have only take the family_name from my
examples and left out the russian cyrilic values and the UTF-16
representations.
a) write via a script (console or sqlexplorer)
write | read | bytes | enc
----------------------------------------------------------------
Käßsel | Käßsel | 4b c3 a4 c3 9f 73 65 6c | UTF-8
Käßsel | Käßsel | 4b e4 df 73 65 6c | ISO-8859-1
Ægÿl | Ægÿl | c3 86 67 c3 bf 6c | UTF-8
Ægÿl | Ægÿl | c6 67 ff 6c | ISO-8859-1
b) write with my test case
write | read | bytes | enc
--------------------------------------------------------------------
Käßsel | KäÃ?sel | 4b c3 83 c2 a4 c3 83 c2 9f 73 65 6c | UTF-8
Käßsel | KäÃ?sel | 4b c3 a4 c3 9f 73 65 6c | ISO-8859-1
Ægÿl | Ã?gÿl | c3 83 c2 86 67 c3 83 c2 bf 6c | UTF-8
Ægÿl | Ã?gÿl | c3 86 67 c3 bf 6c | ISO-8859-1
as you can see the values in b) are being transformed twice.
for example the 'ä' (LATIN SMALL LETTER A WITH DIAERESIS -
codepoint U+00E4) is being escaped into 'c3a4' for UTF-8. now during
a second transformation somebody interprets that as an 8-bit encoding
with the codepoints U+00C3 (LATIN CAPITAL LETTER A WITH TILDE) and
U+00A4 (CURRENCY SIGN) and escapes them gain for UTF-8. the first
gets transformed into 'C383' and the second into 'C2A4'.
see http://www1.tip.nl/~t876506/utf8tbl.html for the encoding-table.
so basically we know how. but we dont know why and even more important
where to switch that off.
i dont think it is a database problem as other apps work seemlessly
with the db. i dont even think that is a bug on the driver. as other
apps are using the same driver with the same connection-url i use
in my test-app.
i am quite sure it is a matter of configuring the driver or
driver-manager correctly. can you assist there?
thank you for your patience.
ciao robertj
smime.p7s
Description: S/MIME Cryptographic Signature