>> I have a couple of emails that were generated using MS >> Outlook which contain some html entities like smart quotes >> and the funny "-" character which just appear as "?" >> characters in the archive. > MS Outlook has a nasty habit of mislabeling the charset of its > messages with iso-8859-1 instead of MS's extension to it that > contain the characters being used.
=v= Some older versions send out mail that does't specify a charset, so many apps assume the text is ASCII (which is how the standard works) though of course it's Windows-1252. =v= Those particular characters in Windows-1252 violate charset standards anyway. Even worse, MS products such as Outlook and Word insert these standard-violating "smart quotes" in the wrong places. Sometimes they're backwards (i.e. a quote will start with a "curly close quote" and end with a "curly open quote"), and usually an apostrophe is turned into a "curly single close quote," which is just wrong. =v= Someone wrote a routine that looks for these encodings and turns them into ASCII equivalents. You lose some fanciness, but what good is fanciness when it's just wrong? This has a much higher probability of turning out correctly than translating them into iso-8859-1 or UTF-8 (or even HTML entities). The code is called "demoroniser" and is available in Perl: http://www.fourmilab.ch/webtools/demoroniser/ It has been widely ported. For example, it's in CPAN's TextToHTML Perl module and is part of Macromedia's ColdFusion web product. <_Jym_>
