We have a database where names and addresses have been 'polluted' by partial html escaping. Its a mess, in that some entities are escaped (< back ticks and some others) some are sometimes escaped and ampersands are sometimes escaped. The chickens come home to roost in the java pdf generation process, a name like "M&M Electric" chokes the (yes, by hand) parser - it turns it into "M&M; Electric" and complains there's no &M; entity defined. Bleah.
It uses htmldoc (http://www.easysw.com/htmldoc/index.html) to turn html into a pdf and as its the java data extract and html generation that chokes, I thought I'd try to do it in perl. That way I can stil create the html and use htmldoc, which seems like a pretty slick package. Anyway, I got all the named entities (the numbered ones aren't a problem), created a hash: %html_entities = ( "quot" => 1, "amp" => 1, "lt" => 1, "gt" => 1, "nbsp" => 1, ... [ 200 more entities ] and came up w/: sub clean_html { my $string = shift; my @ents = split(/&/, $string); my $ent; my $new_str = shift(@ents); # for strings starting w/ an amp foreach $ent ( @ents ) { print "Ent: $ent\n" if $verbose > 3; if ( $ent =~ /^(\w{2,7});/ ) { print STDERR "got a possible ent, $1\n" if $verbose; my $val = lc($1); if ( $html_entities{$val} ) { $ent =~ s/^/\&/; # valid, leave alone } else { print "Nope: $val\n" if $verbose > 3; $ent =~ s/^/\&/; } # if html_entity } elsif ( /^#\d{3};/ ) { $ent =~ s/^/\&/; # valid, leave alone } else { $ent =~ s/^/\&/; } $new_str .= $ent; } $new_str .= "&" if $string =~ /&$/; # ending amp return $new_str; } # sub clean_html The idea was just to fix/replace the regular '&' w/ "&", regular being any that don't mark a valid named/numbered entity. I didn't necessarily want to golf it but it certainly isn't the most wieldy sort of mess. I tried CGI (un)escapeHTML but due to the mix of valid and invalid , as in: L&L Electric"s Work well, they didn't work. a Andy Bach, Sys. Mangler Internet: [EMAIL PROTECTED] VOICE: (608) 261-5738 FAX 264-5030 "Believe nothing, no matter where you read it or who has said it, not even if I have said it, unless it agrees with your own reason and your own common sense." Buddha
