On Sat, Sep 07, 2002 at 09:05:13PM -0400, Rick Dillon wrote: > Hello. > > I am currently populating html pages with content from MS Excel. I am > using a Java program that literally places the Excel content directly > into the output code (which is saved as html). It appears that Excel > is using Unicode characters, which is causing strange glyphs when the > html is viewed in a browser. Is there a Perl Way to parse the output > and replace the Unicode characters with asciii, or UTF-8 equivalents?
I don't know the answer to this for sure (but my guess from your description is that Excel is using 16 bit representation of Unicode, and your browser expects an 8 bit encoding of some form). If so, and Excel is only placing Unicode code points in the range 0-255 in your HTML page, then I think something as simple as s/\0(.)/$1/mg in any perl (probably even perl4) would work. But this is a cheap hack, and likely to break. If your data from Excel really has Unicode code points >256, or may do in the future, then really there's no reliable way to fix your HTML file once it has a mix of 1 byte and 2 byte characters in it. Either your Java program should do the conversion to 8 bit (the encoding to UTF8 is not hard, perl's utf8.h says: /* The following table is from Unicode 3.2. Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte U+0000..U+007F 00..7F U+0080..U+07FF C2..DF 80..BF U+0800..U+0FFF E0 A0..BF 80..BF U+1000..U+CFFF E1..EC 80..BF 80..BF U+D000..U+D7FF ED 80..9F 80..BF U+D800..U+DFFF ******* ill-formed ******* U+E000..U+FFFF EE..EF 80..BF 80..BF U+10000..U+3FFFF F0 90..BF 80..BF 80..BF U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF U+100000..U+10FFFF F4 80..8F 80..BF 80..BF Note the A0..BF in U+0800..U+0FFF, the 80..9F in U+D000...U+D7FF, the 90..BF in U+10000..U+3FFFF, and the 80...8F in U+100000..U+10FFFF. The "gaps" are caused by legal UTF-8 avoiding non-shortest encodings: it is technically possible to UTF-8-encode a single code point in different ways, but that is explicitly forbidden, and the shortest possible encoding should always be used (and that is what Perl does). */ and the relevant part of utf8.c for code points between 0x80 and 0x10000: if (uv < 0x800) { *d++ = (U8)(( uv >> 6) | 0xc0); *d++ = (U8)(( uv & 0x3f) | 0x80); return d; } if (uv < 0x10000) { *d++ = (U8)(( uv >> 12) | 0xe0); *d++ = (U8)(((uv >> 6) & 0x3f) | 0x80); *d++ = (U8)(( uv & 0x3f) | 0x80); return d; } ) or alternatively your Java program should output the HTML file entirely in 16 bit, and then use something else (eg perl) to convert that to UTF8 or whatever your browser likes. Converting the representation of Unicode from 16 bit UCS-2 to UTF8 is just byte shuffling, so any perl can do it. Offhand, I don't know if there are modules on CPAN already to do it, but I'd be surprised if there none - try http://search.cpan.org/ > And do I need to upgrade to perl 5.6 to do this? If you are considering upgrading from something like 5.005, is there any reason not to consider going straight to 5.8.0? The Unicode 5.8.0 support in 5.8.0 is much better than 5.6.1, and it also fixes many of the bugs still present in 5.6.1. (Nothing is perfect - a few new bugs have been reported in 5.8.0, but generally it does seem stable and of good quality) Nicholas Clark -- Even better than the real thing: http://nms-cgi.sourceforge.net/