Dawn, Look at chapter 6.1.1.1 in the MySQL docs:
http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#Literals Practically the first odd thing mentioned is escape sequences. Keep in mind that the ASCII code for the backslash character, which is used to initiate escape sequences, is 0x5c. (I work with php and perl. Anyone have a good sample of input filters from .asp files to MySQL?) > Here's a brief summary of where we are: > I trying to store Japanese text (Shift_Jis) in MySQL and view it from a > web page. The content is provided to me in Word format. I convert it to > plain text, copy/paste into a web form in an ASP-based CMS on a Windows > box. When viewed from a web page, seemingly random characters are > morphed into other characters. The majority of the database contains > rows in Latin. MySQL supports Japanese and Latin in the same table. > Other people are able to do this without the morphing problem. My > Regional & Language settings in Windows are set to to Shift_Jis in order > to view Shift_Jis characters in notepad and the DOS prompt. If I > circumvent the CMS and copy/paste from notepad directly to MySQL in the > DOS Prompt, the results are the same (although fewer characters are > broken when viewed through DOS). > > For a good explanation visit this problem's web site: > http://commworks01.barklouder.com/japan/press/broken_chars.asp Hmm. The characters you pasted directly into the .asp file did not survive intact. I can't read that very first sample, can't even guess what that's saying. The first few words are "fourth quarter soft (something) market", but enough falls apart after that that it's hard to tell what else is missing. It looks like something about functionality being scheduled for development. If you understand what MySQL is doing to two-byte characters which have a second byte of 0x5c, then you are ready to dig into ASP and find out if ASP wants the text escaped somehow. (Or take that question to an ASP mailing list.) Can you post that page as pure, unserved, html and send me the link off-list? (Text, with the .htm extension, unless your server forces html through asp, too.) If I can make sense of it as pure html text, you'll be able to completely rule MSWord and the OS out. > I conclude that one of two things may be happening: > 1. Characters are being corrupted by virtue of the fact that their > source of origination were copied from Word, despite the conversion to > plain text. (At this point I do not have a plain text file with content > typed directly into notepad....i.e. Word circumvented. I am at the mercy > of the client's PR department.) No real need to worry about MSWord, I think. Anyway, if, as you say below, pasting the characters in static HTML is okay, you can be sure that MSWord is giving you no problems now. > 2. Characters are being corrupted by MySQL. Well, sort of. Except that MySQL is not really the culprit, because the behavior in question is part of the spec, and has been for quite a while. (At least one user of MySQL wanted MySQL to change their spec to conform with Oracle's spec, but since the state of the SQL standard is a mess, it's a hard point to argue right now. But the escape sequence _is_ part of MySQL's spec.) > If option 1 were true, then why do the characters show up fine when in a > static HTML document? (see below). I want to see that static HTML. > In Response to Joel Rees: > > I checked the text you gave me, and I found what's getting > > clobbered. It's the latter half of characters like the katakana 'so'. > > > > Although the byte that is getting walked on here is 0x5c, > > this is _not_ the escape character. It is preceded (in the > > case of katakana 'so') by a byte of 0x83. The entire > > character is '0x835c', and the 0x5c is being treated as if it > > were a backslash. There are other characters that will get > > hit by this, by the way. > > Question 1: It seems like a lot more characters are getting hit than > just '0x835c'. How do I map the 0x835c to what the character looks like? I said "like". I suppose I was not clear about how they would be similar. Two-byte characters with a final byte of 0x5c are going to be caught by MySQL, interpreted as "something" followed by an escape character, followed by the next byte (first byte of the next character) as a literal. In some odd situations, you might end up with a control character, in others, the 0x5c just simply disappears, leaving the character stream corrupted. Only one byte lost, but the final app will think that characters are starting on what is really the second byte. Once you lose one character, a whole bunch get out of sync. Example, the sequence for "sofuto" (modern Japanese word imported from the English "soft") is 0x835c 0x8374 0x8367. If you let MySQL try to interpret the 0x5c as a backslash, it thinks that you're just telling it that it should not do anything out of the ordinary with the 0x83 which follows. The result is 0x8383 0x74 0x8367, which is not too bad. 0x8383 is the subscript 'ya', 0x74 is ASCII 't', and then things sync back in. Sometimes you want be so lucky, and whole phrases will go out of sync. > I don't know what 0x835c is. It's the katakana for the syllable 'so'. It looks like a curved forward slash with a jot above it. (And distinguishing 'so' from the nasal syllable is a little difficult, but all we care about here is the encoding.) Anyway, you'll have problems with 0x815c, 0x825c, 0x845c, etc. I'm not going to tell you which aren't real characters in shift-JIS, however, because you don't really want to know that. > Question 2: How do I handle the character escape mechanism correctly > according to MySQL? See the chapter I mentioned above. In the content tool, you'll need to put a filter, probably using a regular expression, on text before it goes to MySQL. This is really true anyway. If you don't have the filter, regular English text with backslashes will lose its backslashes too. What the filter will do for the backslash is simply put another backslash after it. The only trick is keeping track of where you're working with filtered text and where you aren't. If you're not careful it's easy to end up filtering twice, which doesn't work. And, judging from the .asp page, you may need to filter text going into .asp and coming out, as well. And you need to filter stuff from your forms, too, for security purposes. That will be really off topic here, however. -- Joel Rees <[EMAIL PROTECTED]> --------------------------------------------------------------------- Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php