Re: HTML::Entities and WinLatin1 NCRs [PATCH]

Gisle Aas Tue, 07 Mar 2006 04:53:45 -0800

Chris Darroch <[EMAIL PROTECTED]> writes:

>    I use the HTML::Entities module quite a bit and have really
> appreciated its support for Unicode characters > 256 with Perl 5.8.
> 
>    I do have one particular issue that crops up for me, and I thought
> it might affects others as well, so I'm including a crude set of
> patches with my "fix".  In short, I have to support HTML documents
> authored by a wide variety of people, and over time they've
> accumulated numeric character references to the troublesome set
> of characters between 128 and 159, mostly due to authors working
> on Windows platforms.  The same documents now may also have
> character references to the Unicode code points for those characters.
> 
>    Here's a simple example: "two &#151; em &#8212; dashes".
> 
>    Now, in my particular situation, I sometimes want to decode
> these entities to the same code point, so that, for example, I can
> match strings against each other.  At first I thought I might
> get away with this:
> 
> $a = Encode::encode('utf8', $a);  # force no utf8 flag
> HTML::Entities::decode_entities($a);
> $a = Encode::decode('cp1252', $a) unless (Encode::is_utf8($a));
> 
>    But while that will turn "&#151;" into U+2014, it turns
> "&#151;&#8212;" into U+0097 U+2014, which doesn't help.
> 
>    So, I whacked into place a decode_entities_cp1252() function
> that decodes any numeric characters references in the 128-159
> range (except for a couple of undefined ones) to the UTF-8
> equivalents.  I'm positive there are nicer, more elegant, and
> probably more flexible ways to do this, but lacking additional
> time to experiment, this is where I stopped.


To me it feels wrong to add such a kludge to HTML::Entities.  It just
seems to be the wrong level to do such manipulations.  I would suggest
that you just post-process the string that decode_entities() returns
to fixup the Windows mess using tr///; example:

    sub cp1252_fixup {
        # replaces the additional WinLatin-1 chars in the 0x80 - 0x9F range
        # with the corresponding Unicode character
        my $str = shift;
        $str =~ 
tr/\x80-\x9f/\x{20AC}\x{FFFD}\x{201A}\x{192}\x{201E}\x{2026}\x{2020}\x{2021}\x{2C6}\x{2030}\x{160}\x{2039}\x{152}\x{FFFD}\x{17D}\x{FFFD}\x{FFFD}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{2DC}\x{2122}\x{161}\x{203A}\x{153}\x{FFFD}\x{17E}\x{178}/;
        $str;
    }
    
    
    my $str = "Here's a simple example: two &#151; em &#8212; dashes";
    
    use HTML::Entities;
    $str = cp1252_fixup(HTML::Entities::decode($str));
    
    use Data::Dump;
    print Data::Dump::dump($str), "\n";

Dan: Would it make sense to make Encode provide something like
cp1252_fixup or is there already a way to do this with Encode?

Regards,
Gisle

Re: HTML::Entities and WinLatin1 NCRs [PATCH]

Reply via email to