I'm working on a PHP-based CMS that allows users to post lengthy
article texts by submitting through a form. The short version of my
quandary is this: How can I create a conversion routine that reliably
substitutes HTML-acceptable output for high-ASCII characters pasted
into the form (from a variety of operating systems)?
The longer version is this:
In order to prevent scripting vulnerabilities and a variety of other
undesirable content, I run the body of the text through a cleantext()
function. This function first strips out illegal HTML tags and
JavaScript. So far so good.
Then it attempts to perform some character conversions to clean up
8-bit ASCII characters in the text, so smart quotes, en- and em-dashes,
ellipses, etc. are converted to suitable alternative, or to HTML
entities. I'm using:
// Reference:
// chr(133) = ellipsis
// chr(145) = left curly single quote
// chr(146) = right curly single quote (apostrophe)
// chr(147) = left curly double quote
// chr(148) = right curly double quote
// chr(149) = bullet
// chr(150) = en dash
// chr(151) = em dash
// chr(153) = trademark
// chr(160) = non-breaking space
// chr(161) = inverted exclamation mark
// chr(169) = copyright symbol
// chr(171) = left guillemet
// chr(173) = soft hyphen
// chr(174) = registered trademark
// chr(187) = right guillemet
// chr(188) = 1/4 fraction
// chr(189) = 1/2 fraction
// chr(190) = 3/4 fraction
// chr(191) = inverted question mark
$changearr = array("�"=>" ",
"\r"=>"\n",
"\r\n"=>"\n",
"\n\n\n" => "\n\n",
chr(133)=>"...",
chr(145)=>"'",
chr(146)=>"'",
chr(147)=>"\"",
chr(148)=>"\"",
chr(149)=>"*",
chr(150)=>"-",
chr(151)=>"--",
chr(153)=>"(TM)",
chr(160)=>" ",
chr(161)=>"¡",
chr(169)=>"©",
chr(171)=>"«",
chr(173)=>"-",
chr(174)=>"(R)",
chr(187)=>"»",
chr(188)=>"1/4",
chr(189)=>"1/2",
chr(190)=>"3/4",
chr(191)=>"¿");
$returnstr = strtr($returnstr,$changearr);
The server's on a Linux box (RedHat 7.2, standard US installation);
users can obviously post from any sort of operating system.
This routine seems to work well on Word text pasted in from my Mac (OS
X 10.2.1), but I see a number of articles appearing on the site with
text like:
Wouldn�(TM)t you say?
(That's "Wouldn[a circumflex][Euro symbol](TM)t" instead of "Wouldn't".
...which was almost definitely pasted in from a Windows-based Microsoft
Word, and the conversion routines are failing. (And inserting even
weirder characters...why would the single quote be replace by _3_
character substitutions?)
I understand that Windows may well use a different character set for
high-ASCII, but I frankly don't understand how to work that knowledge
into this situation. And the combination of original text, Linux ,
chr(), and ord() stuff just doesn't make sense to me. For example, if I
post text (from my Mac) containing only:
�����
(that's
[open-double-quote][close-double-quote][open-single-quote][close-
single-quote][ellipsis])
and have PHP run this:
for ($x = 0; $x < strlen($str); $x++) {
$mailstr .= $str[$x].' is '.ord($str[$x])."\n";
}
mail('me','Characters',$mailstr);
I get mail that says (in parentheses is a description of the character):
� is 147 (accent-grave-i)
� is 148 (circumflex-i)
� is 145 (umlaut-e)
� is 146 (accent-acute-i)
� is 133 (umlaut capital o)
...which means that "recognizes" the correct ASCII value (147) of a
double-quote, though my Linux box seems to think that the character is
a lowercase "i" with a grave accent on it. With this kind of strange
sub-conversion going on, I'm not all that surprised that things are
getting mucked up.
Is there some way of getting pasted Word text from Windows "clean" in
this manner, as well as accommodating the already-working-right Mac
Word text?
Cheers,
spud.
-------------------------------------------------------------
a.h.s. boy
[EMAIL PROTECTED]
dadaIMC support
http://www.dadaimc.org/
-------------------------------------------------------------
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
- [PHP] Re: Cleaning pasted Word text a . h . s . boy
- [PHP] Re: Cleaning pasted Word text Philip Hallstrom
- Re: [PHP] Cleaning pasted Word text Brent Baisley
- Re: [PHP] Cleaning pasted Word text a . h . s . boy
- Re: [PHP] Cleaning pasted Word text Daniel Guerrier
- Re: [PHP] Cleaning pasted Word text Jimmy Brake
- Re: [PHP] Cleaning pasted Word text a . h . s . boy
- [PHP] Why this open_basedir warning? Charles Wiltgen
- Re: [PHP] Why this open_basedir warning? Ernest E Vogelsinger
- Re: [PHP] Why this open_basedir warning? Charles Wiltgen
- Re: [PHP] Why this open_basedir warn... Ernest E Vogelsinger

