On Sunday 13 November 2011 01:32:40, Petko Yotov wrote : > There are indeed problems with some characters such as typographical > apostrophes and dashes, and yes, they are different from normal > apostrophes. ... > For some reason, the browsers don't treat these characters the same way as > PHP does. The PHP iconv() function, like the `iconv` system program, > appear unable to convert these characters so that the browsers display > them correctly.
I should add the utf_encode() function. These characters appear to be non-standard, or more precisely from a different standard. The code points 128-159 (0x80-0x9F) are not denined in the ISO-8859-1 charset, they are defined in the Windows-1252 charset: https://en.wikipedia.org/wiki/ISO-8859-1 https://en.wikipedia.org/wiki/Windows-1252 (the special characters are in the cells with thick green borders) >From Wikipedia: It is very common to mislabel Windows-1252 text with the charset label ISO-8859-1. A common result was that all the quotes and apostrophes (produced by "smart quotes" in Microsoft software) were replaced with question marks or boxes on non-Windows operating systems, making text difficult to read. Most modern web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 in order to accommodate such mislabeling. This is now standard behavior in the draft HTML 5 specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding. So, the PHP conversion functions actually follow the standard, but the text sent by the browsers is not completely standard. In order to convert these characters, maybe our automatic conversion from ISO-8859-1 to UTF-8 should do the same : consider the page text as Windows-1252. Indeed, if the text contains characters at these code points, these characters can only be Windows-1252-encoded. Petko _______________________________________________ pmwiki-users mailing list [email protected] http://www.pmichaud.com/mailman/listinfo/pmwiki-users
