At 21:07 Uhr +0100 16.01.2003, Detlef Lindenthal wrote:
###   What is the best way to search and replace in a .html-document
###   all text outside HTML-tags?
###   Would HTML::Parser be of any avail for this task?
###   All hints are welcome.

###   Detlef


###   In the following approach I only capitalize
###   all text . (Actually I have to do some kind
###   of spell checking on it.)
###   What exceptions are not covered by this regex?
###   What about speed and efficiency?

$_ = join "", <DATA>;
s,([^<]*)(<.*?>),uc($1).$2,ges;
print

__DATA__
<html><head></head><body>
hallo world
</body></html>


###   This produces:

<html><head></head><body>
HALLO WORLD
</body></html>

I'd use HTML::TokeParser for this task:

require HTML::TokeParser;
$p = HTML::TokeParser->new("index.html") || die "Can't open: $!";
while (my $token = $p->get_token) {
    my $tokentype = $token->[0];
    if ($tokentype eq 'T') {
        my $text = $token->[1];
        print "Text: -$text- \n";
    }
}

This produces (for your data saved as file "index.html"):

Text: -
hallo world
-

Note that newlines are part of the text.


HTH,
Thomas.

Reply via email to