IIRC, HTML lets you omit the semicolon at the end of an entity under certain circumstances, e.g., when the entity is followed by whitespace. Do some of the entries take advantange(?!) of this? If so, you may want to match (?:;|\b) instead of plain old semicolon.
I wonder if ambiguities might pop up in your data. "am&pm Convenience Store"? LP^> On Wed, Apr 02, 2003 at 10:25:05PM -0600, [EMAIL PROTECTED] wrote: > We have a database where names and addresses have been 'polluted' by > partial html escaping. Its a mess, in that some entities are escaped > (< back ticks and some others) some are sometimes escaped and > ampersands are sometimes escaped. The chickens come home to roost in the > java pdf generation process, a name like "M&M Electric" chokes the (yes, > by hand) parser - it turns it into "M&M; Electric" and complains there's > no &M; entity defined. Bleah. > > It uses htmldoc (http://www.easysw.com/htmldoc/index.html) to turn html > into a pdf and as its the java data extract and html generation that > chokes, I thought I'd try to do it in perl. That way I can stil create > the html and use htmldoc, which seems like a pretty slick package. > > Anyway, I got all the named entities (the numbered ones aren't a problem), > created a hash: > %html_entities = ( > "quot" => 1, > "amp" => 1, > "lt" => 1, > "gt" => 1, > "nbsp" => 1, > ... [ 200 more entities ] > > and came up w/: > sub clean_html > { > my $string = shift; > my @ents = split(/&/, $string); > my $ent; > my $new_str = shift(@ents); # for strings starting w/ an amp > foreach $ent ( @ents ) { > print "Ent: $ent\n" if $verbose > 3; > if ( $ent =~ /^(\w{2,7});/ ) { > print STDERR "got a possible ent, $1\n" > if $verbose; > my $val = lc($1); > if ( $html_entities{$val} ) { > $ent =~ s/^/\&/; # valid, leave alone > } else { > print "Nope: $val\n" > if $verbose > 3; > $ent =~ s/^/\&/; > } # if html_entity > > } elsif ( /^#\d{3};/ ) { > $ent =~ s/^/\&/; # valid, leave alone > } else { > $ent =~ s/^/\&/; > } > $new_str .= $ent; > } > $new_str .= "&" if $string =~ /&$/; # ending amp > return $new_str; > > } # sub clean_html > > The idea was just to fix/replace the regular '&' w/ "&", regular being > any that don't mark a valid named/numbered entity. I didn't necessarily > want to golf it but it certainly isn't the most wieldy sort of mess. I > tried CGI (un)escapeHTML but due to the mix of valid and invalid , as in: > L&L Electric"s Work > well, they didn't work. > > a > > Andy Bach, Sys. Mangler > Internet: [EMAIL PROTECTED] > VOICE: (608) 261-5738 FAX 264-5030 > > "Believe nothing, no matter where you read it or who has said it, not even > if I have said it, unless it agrees with your own reason and your own > common sense." Buddha
