On Wed, Apr 02, 2003 at 10:25:05PM -0600, [EMAIL PROTECTED] wrote:

> Anyway, I got all the named entities (the numbered ones aren't a problem), 
> created a hash:
> %html_entities = (
> "quot" => 1,
> "amp" => 1,
> "lt" => 1,
> "gt" => 1,
> "nbsp" => 1,
> ... [ 200 more entities ]
> 
> and came up w/:
> sub clean_html
> {
>   my $string = shift;
>   my @ents = split(/&/, $string);

The use of split seems an odd choice.  I would try doing this as a
substitution with /e.  Then, using a lookahead, all that needs to be
replaced is the ampersand.  Here's one way to do it:

s/&(?=(\w{2,7});|#(\d{3});|)/
  my $replace;
  if ($1 and exists $html_entities{$1}) {
    # named entity
    $replace = '&';
  } elsif ($2) {
    # numeric entity
    $replace = '&';
  } else {
    # not an entity
    $replace = '&';
  }
  $replace;
/ge;


Ronald

P.S.  If you wanted to golf it, perhaps something like this, with
%html_entities renamed to %h:

$"='|';s/&([EMAIL PROTECTED];|#\d{3};)/&/g;

Of course, @h would be even better in this case.  :)

Reply via email to