I'm working on a PHP-based CMS that allows users to post lengthy article texts by submitting through a form. The short version of my quandary is this: How can I create a conversion routine that reliably substitutes HTML-acceptable output for high-ASCII characters pasted into the form (from a variety of operating systems)?

The longer version is this:
In order to prevent scripting vulnerabilities and a variety of other undesirable content, I run the body of the text through a cleantext() function. This function first strips out illegal HTML tags and JavaScript. So far so good.

Then it attempts to perform some character conversions to clean up 8-bit ASCII characters in the text, so smart quotes, en- and em-dashes, ellipses, etc. are converted to suitable alternative, or to HTML entities. I'm using:

// Reference:
// chr(133) = ellipsis
// chr(145) = left curly single quote
// chr(146) = right curly single quote (apostrophe)
// chr(147) = left curly double quote
// chr(148) = right curly double quote
// chr(149) = bullet
// chr(150) = en dash
// chr(151) = em dash
// chr(153) = trademark
// chr(160) = non-breaking space
// chr(161) = inverted exclamation mark
// chr(169) = copyright symbol
// chr(171) = left guillemet
// chr(173) = soft hyphen
// chr(174) = registered trademark
// chr(187) = right guillemet
// chr(188) = 1/4 fraction
// chr(189) = 1/2 fraction
// chr(190) = 3/4 fraction
// chr(191) = inverted question mark
$changearr = array(""=>" ",
"\n\n\n" => "\n\n",
chr(160)=>" ",
$returnstr = strtr($returnstr,$changearr);

The server's on a Linux box (RedHat 7.2, standard US installation); users can obviously post from any sort of operating system.

This routine seems to work well on Word text pasted in from my Mac (OS X 10.2.1), but I see a number of articles appearing on the site with text like:

Wouldn(TM)t you say?

(That's "Wouldn[a circumflex][Euro symbol](TM)t" instead of "Wouldn't".

...which was almost definitely pasted in from a Windows-based Microsoft Word, and the conversion routines are failing. (And inserting even weirder characters...why would the single quote be replace by _3_ character substitutions?)

I understand that Windows may well use a different character set for high-ASCII, but I frankly don't understand how to work that knowledge into this situation. And the combination of original text, Linux , chr(), and ord() stuff just doesn't make sense to me. For example, if I post text (from my Mac) containing only:

(that's [open-double-quote][close-double-quote][open-single-quote][close- single-quote][ellipsis])

and have PHP run this:

for ($x = 0; $x < strlen($str); $x++) {
$mailstr .= $str[$x].' is '.ord($str[$x])."\n";

I get mail that says (in parentheses is a description of the character):

is 147 (accent-grave-i)
is 148 (circumflex-i)
is 145 (umlaut-e)
is 146 (accent-acute-i)
is 133 (umlaut capital o)

...which means that "recognizes" the correct ASCII value (147) of a double-quote, though my Linux box seems to think that the character is a lowercase "i" with a grave accent on it. With this kind of strange sub-conversion going on, I'm not all that surprised that things are getting mucked up.

Is there some way of getting pasted Word text from Windows "clean" in this manner, as well as accommodating the already-working-right Mac Word text?


a.h.s. boy
dadaIMC support

PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to