> Lastly, I am convinced that this is a problem with MySQL.

I have some time again, let me try again. It's a tortuous tale, but bear
with me.

Background in brief:

Shift-JIS is a real pain in the neck to parse. When you look at a byte,
it's often impossible to tell whether you're looking at the first byte
or the second byte of the character.

Shift-JIS has this almost-neat feature where they shoe-horned US-ASCII,
with the two exceptions I've mentioned before, into the lowest 128 codes.
_If_ _you_ _parse_ _from_ _the_ _beginning_, you can deterministically
pick off each one-byte character and each two byte character. 

Unfortunately, if line-noise causes you to miss a byte, or if you start
anywhere but the beginning of a string, you can run into a lot of
situations where you can't tell whether you're in the middle of a
character or not.

The byte value of 0x5c, which is that wonderful backslash which MySQL
(and C, PHP, Java, et. al.) uses as an escape character, is one case in
point. It can be either a single byte of backslash, or it can be the end
byte of one of several valid Japanese characters.

*** moderately important stuff ***

I believe, if you set up MySQL to operate in shift-JIS, it parses the
characters correctly, in addition to handling sort and collation
"correctly". But that's kind of beside the point here. If my memory here
is correct, you can set your whole database (well, your instance of
MySQL) to assume the shift-JIS character set, and MySQL will properly
ignore any 0x5c that is not a backslash.

*** end moderately important stuff ***

(unimportant stuff)
(The fact that the backslash shows up as the yen symbol in your browser
when looking at a shift-JIS page is a red herring. I shouldn't have
brought it up. Ignore it. Another piece of information you didn't need,
and should ignore, is that shift-JIS double-wide backslash is 0x815f. It
is not used as an escape character, so just forget I told you its value.
Don't bother looking for it in your text. It has nothing to do with
what's going on. Incidentally, its end byte is the same value as
US-ASCII underscore, which you also don't need to know.)
(end unimportant stuff)

But, you are trying to use multiple languages in a single database. It
so happens that, if you play by the rules (which are a little difficult
to keep straight the first time) you can get away with this. Why?
Because you are not sorting or collating, just getting the data out by
keys that have nothing to do with shift-JIS.

In either the current MySQL 4.0 or the next version, I don't remember
which, I am told you will be able to set the language for each table.
This will be marvelous. A lot of peace of mind, and great help, in fact.
But not necessary for your current project, since you aren't collating
or sorting on Japanese.

****** extremely important stuff ******

The issue you face is how to avoid the 0x5c being treated as an escape
character.

But it doesn't matter whether the text is Japanese or not.

****** end extremely important stuff ******

Let me explain that. You are taking text from an ordinary source and
pasting it into MySQL (through the command line, shall we say). You
should never have any escapes in that text. If you see a \, it is merely
a backslash. \t, if it were in your text, would not mean the tab
character. \n would not mean a newline. A quote would not be the
beginning of a string, and a \" would be just that, a backslash followed
by a double quote. See

    http://www.mysql.com/doc/en/String_syntax.html 

Note carefully that this page tells you exactly what characters will
require escaping. (Sorry I can't provide the code. It shouldn't be that
hard to write, or maybe someone else on the list who uses .asp will
volunteer. As a hints, I'll give you a bit of untested C below.)

If MySQL allows you to turn off character escaping, you would want to
simply turn it off. As I understand, it doesn't. So you need to catch
NULs, backslashes, and single and double quotes, and stick a(nother)
backslash in front of them before you pass them to the database.

(This should feel a lot like what you do when picking up text from a
form on the web to go into an html page. But it is done at a different
time, and the characters to escape are different.)

****** crux-of-the-matter ******

This is crude, it feels like a real kludge, but it works. 

Since your database is not set up to parse shift-JIS, it is absolutely
going to think that every 0x5c it sees is a backslash, even when preceded
by 0x83 or some other lead byte of a valid shift-JIS character. It
doesn't know for shift-JIS. Because you haven't set it up to care, it
doesn't care at all, and just basically ignores the whole multi-byte
character business.

So, you just treat the 0x5c end-bytes like backslashes. Escape them like
they were real backslashes, and the escaping backslash gets eaten on its
way into the database.

****** end crux-of-the-matter ******

0x835c, for instance: Your escaping routine doesn't need to pay any
attention to the 0x83 (or whatever the lead byte might have been). All
it needs to do is see the 0x5c and paste a 0x5c in front of it. MySQL
sees the filtered 0x835c5c and strips the first escape character. What
goes into the database is 0x835c. 

I make no claims about the following code, it is merely illustrative:

void escapeForMySQL( char source[], char buffer[], int count )
{
    int ch, i;

    j = 0;
    for ( i = 0; i < count; i = i + 1 )
    {
        ch = source[ i ];
        if ( ( ch == '\\' )  /* C also uses backslash */
             or ( ch == '\'' ) /* (C++ syntax) */ 
             or ( ch == '"' )
             or ( ch == '\0' ) )
        {
            buffer[ j ] = '\\';
            j = j + 1;
        }
        buffer[ j ] = source[ i ];
        j = j + 1;
    }
}

It's not my usual style, but I think it should be easy to translate into
VBScript and test. 

If I messed that up, someone correct me?

If you know regular expressions, the RE should be straightforward. 

MySQL provides the escaping function as part of the C API for MySQL, by
the way. You might want to look at the source for that if you are not
comfortable with the above, or if you just like to read source.

Anyway, you should do something like the above in between getting the
text from your content tool's form and putting it into MySQL.

I'd give you a link to some of my rants about characters, but you really
shouldn't need to be distracted by that right now.

Now that I've run you through the wringer with MySQL's escaping, I have
some bad news for you. The above will fix the problem with getting your
data into MySQL. I checked your screenshot and it definitely will fix
the problem with getting your data into MySQL.

But there is something else going on between getting the data out of
MySQL and getting it into your pages. 0x836c somehow morphed to 0x8e4c.
Or was that the other direction? That won't be a topic for the MySQL
list, however. 

-- 
Joel Rees <[EMAIL PROTECTED]>


---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Reply via email to