Chris Darroch <[EMAIL PROTECTED]> writes: > I use the HTML::Entities module quite a bit and have really > appreciated its support for Unicode characters > 256 with Perl 5.8. > > I do have one particular issue that crops up for me, and I thought > it might affects others as well, so I'm including a crude set of > patches with my "fix". In short, I have to support HTML documents > authored by a wide variety of people, and over time they've > accumulated numeric character references to the troublesome set > of characters between 128 and 159, mostly due to authors working > on Windows platforms. The same documents now may also have > character references to the Unicode code points for those characters. > > Here's a simple example: "two — em — dashes". > > Now, in my particular situation, I sometimes want to decode > these entities to the same code point, so that, for example, I can > match strings against each other. At first I thought I might > get away with this: > > $a = Encode::encode('utf8', $a); # force no utf8 flag > HTML::Entities::decode_entities($a); > $a = Encode::decode('cp1252', $a) unless (Encode::is_utf8($a)); > > But while that will turn "—" into U+2014, it turns > "——" into U+0097 U+2014, which doesn't help. > > So, I whacked into place a decode_entities_cp1252() function > that decodes any numeric characters references in the 128-159 > range (except for a couple of undefined ones) to the UTF-8 > equivalents. I'm positive there are nicer, more elegant, and > probably more flexible ways to do this, but lacking additional > time to experiment, this is where I stopped.
To me it feels wrong to add such a kludge to HTML::Entities. It just seems to be the wrong level to do such manipulations. I would suggest that you just post-process the string that decode_entities() returns to fixup the Windows mess using tr///; example: sub cp1252_fixup { # replaces the additional WinLatin-1 chars in the 0x80 - 0x9F range # with the corresponding Unicode character my $str = shift; $str =~ tr/\x80-\x9f/\x{20AC}\x{FFFD}\x{201A}\x{192}\x{201E}\x{2026}\x{2020}\x{2021}\x{2C6}\x{2030}\x{160}\x{2039}\x{152}\x{FFFD}\x{17D}\x{FFFD}\x{FFFD}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{2DC}\x{2122}\x{161}\x{203A}\x{153}\x{FFFD}\x{17E}\x{178}/; $str; } my $str = "Here's a simple example: two — em — dashes"; use HTML::Entities; $str = cp1252_fixup(HTML::Entities::decode($str)); use Data::Dump; print Data::Dump::dump($str), "\n"; Dan: Would it make sense to make Encode provide something like cp1252_fixup or is there already a way to do this with Encode? Regards, Gisle