This isn't really a problem with MySQL; MySQL just trips on it.

Let me caution you that this discussion applies to Windows. I did observe
this in my applications on Linux, but the problem will show itself only when
reading a file, whether uploaded or not; there will be no problem if the
data is entered into a form in a web browser. I did most of my testing with
0x96, the Windows n-dash.

- Trying to INSERT a "funky" character will give you a "data truncated"
message from a 4.x server, or a complaint about the character not being
valid in the target character set from a 5.x server. The error message from
the 5.x server will point you directly to the problem character.

- The ultimate cause of the problem is that Windows applications use the
CP-1252 character set. It is similar to, but not identical to, UTF-8.

- MySQL does not recognize the CP-1252 character set, at least not out of
the box. Therefore you cannot get MySQL to convert CP-1252 characters to
UTF-8. You could try CP-1251, I didn't test that.

- Sourcing a file containing one of these troublesome characters into the
MySQL CLI will trigger the problem, because the data is sucked in as-is.

- Copying and pasting into the CLI will sometimes avoid the problem because
Windows automatically transliterates some characters during that operation.

- Windows does NOT TRANSLATE the characters. Note that I said
"transliterates". The 0x96 character, the CP-1252 n-dash, is silently
changed to 0x2D: a hyphen when you paste it into a console application.

- The reason that data entered into a web form works is that IE (and I
suppose other browsers) maps the CP-1252 characters into their corresponding
UTF-8 characters. Thus 0x96, when pasted into a browser, turns into 0xe28093
when pushed back to the server. That three-byte sequence is the UTF-8 n-dash
character.

What we have chosen to do is to transliterate the most common of these
troublesome characters ourselves. In PHP, when working with a file it looks
like this:

        $cp1252_special_chars = array(
                "\x96" => "-",
                "\x97" => "--",
                "\x91" => "\\'",
                "\x92" => "\\'",
                "\x85" => "...",
                "\x93" => "\"",
                "\x94" => "\""
        );

On the web server side (UTF-8) it looks like this:

                $utf_special_chars = array(
                        "\xe2\x80\x93" => "-",
                        "\xe2\x80\x94" => "--",
                        "\xe2\x80\x98" => "\\'",
                        "\xe2\x80\x99" => "\\'",
                        "\xe2\x80\xa6" => "...",
                        "\xe2\x80\x9c" => "\"",
                        "\xe2\x80\x9d" => "\""
                );

This is not a perfect solution for everyone, but it suits our needs. The
data we work with comes in higgledy-piggledy, and we want it consistent
whether someone sends us a data file or pastes a string into one of our
forms.

If somebody (not I, given my level of knowledge and available time) adds
CP-1252 to the character sets available with MySQL, then you could use
CONVERT() with USING to handle this (assuming that you knew ahead of time
where your data was coming from).

I hope this helps, I grew a few rings and shed a lot of bark to get here.

Regards,

Jerry Schwartz
The Infoshop by Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341

www.the-infoshop.com
www.giiexpress.com
www.etudes-marche.com





-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to