On Wed, Apr 02, 2003 at 10:25:05PM -0600, [EMAIL PROTECTED] wrote:
> Anyway, I got all the named entities (the numbered ones aren't a problem),
> created a hash:
> %html_entities = (
> "quot" => 1,
> "amp" => 1,
> "lt" => 1,
> "gt" => 1,
> "nbsp" => 1,
> ... [ 200 more entities ]
>
> and came up w/:
> sub clean_html
> {
> my $string = shift;
> my @ents = split(/&/, $string);
The use of split seems an odd choice. I would try doing this as a
substitution with /e. Then, using a lookahead, all that needs to be
replaced is the ampersand. Here's one way to do it:
s/&(?=(\w{2,7});|#(\d{3});|)/
my $replace;
if ($1 and exists $html_entities{$1}) {
# named entity
$replace = '&';
} elsif ($2) {
# numeric entity
$replace = '&';
} else {
# not an entity
$replace = '&';
}
$replace;
/ge;
Ronald
P.S. If you wanted to golf it, perhaps something like this, with
%html_entities renamed to %h:
$"='|';s/&([EMAIL PROTECTED];|#\d{3};)/&/g;
Of course, @h would be even better in this case. :)