Character Sets, 4.0 and 4.1

Bruce Dembecki Fri, 04 Feb 2005 06:56:03 -0800

So today for the second time in six weeks we are faced with rolling back to
mysql 4.0 because of dramas with character sets. I don't know about anyone
else but this supposedly wonderful feature has been nothing but a nightmare
for us.


So our Application servers use Unicode for our non US English products, and
they talk to MySQL through Connector J with a flag set to use Unicode in the
JDBC config.

First time around we just dumped the data and then imported it into the 4.1
instance. Everything looked good, but it wasn't. The German folks were
complaining their various umlauts and so on were missing, and there was
more. Of course we're told to just bring the data over to mysql 4.1 and
we'll have no problems, so we do that, and because we didn't specify a
character set for the import, we got latin1, and our German and Chinese
and... All broke.

So six weeks of trial and experimentation later and we try for another
update. This time in our create database statement when we begin to import
the database, we set the default character set to utf8 for everything. Now
after the import our Germans and Chinese folks still get the results they
expect.

A day later and we are getting complaints from Hong Kong that there are a
whole bunch of messages appearing on their discussions with no message body.
We look at the backend and right there in the database the messages are
sitting and the body consists of exactly one space. Whatever content was
sent to us, was turned into one space. We look at it and we see that there a
more than a few messages that got migrated from 4.0 to 4.1 and their message
bodies are also one space. Not all messages, just some. Not all messages
from any individual user, just some... The 4.0 version of the data has
content that consists of more than a single space... Can't quite tell what
it is, but there's content there in 4.0 that disappears in 4.1.

So I understand that having multiple character sets is a good thing, but to
be honest, I pretty much thought we had it in 4.0.. We told the JDBC to us
Unicode and away we went... Clearly someone was using something that wasn't
unicode (some of the comments suggest that there is some Japanese in the
missing messages, but I can't tell), and for whatever reason mysql 4.1
decided it should be repalced with a space character.

I'm probably missing the point of the character set support along the way
somewhere... But I need to know how to fix this (I understand that's
difficult when all I have left is one blank space and don't know how to
reproduce the problematic data). What did I miss in the simple "open your
data files with 4.1 and it's good to go" instructions... What character set
performs the same as MySQL 4.0, where it didn't care what character set you
gave it, it would accept it? Can we have a character set that will give us
this functionality?

And why are we taking input data on an import and by the looks of it an
insert, and turning it into a single space, can't we do something better
with the data?

4.0 worked for us with products in 20+ languages. It worked with no great
effort and no problems... Now we have the new enhanced version which
provides "better" support for international character sets, and we find
ourselves with lost data from the moment we import, and user posts
disappearing as they come in. What do we do to not have this problem?

Best Regards, Bruce


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Character Sets, 4.0 and 4.1

Reply via email to