At 21:07 Uhr +0100 16.01.2003, Detlef Lindenthal wrote:
### What is the best way to search and replace in a .html-document
### all text outside HTML-tags?
### Would HTML::Parser be of any avail for this task?
### All hints are welcome.
### Detlef
### In the following approach I only capitalize
### all text . (Actually I have to do some kind
### of spell checking on it.)
### What exceptions are not covered by this regex?
### What about speed and efficiency?
$_ = join "", <DATA>;
s,([^<]*)(<.*?>),uc($1).$2,ges;
print
__DATA__
<html><head></head><body>
hallo world
</body></html>
### This produces:
<html><head></head><body>
HALLO WORLD
</body></html>
I'd use HTML::TokeParser for this task:
require HTML::TokeParser;
$p = HTML::TokeParser->new("index.html") || die "Can't open: $!";
while (my $token = $p->get_token) {
my $tokentype = $token->[0];
if ($tokentype eq 'T') {
my $text = $token->[1];
print "Text: -$text- \n";
}
}
This produces (for your data saved as file "index.html"):
Text: -
hallo world
-
Note that newlines are part of the text.
HTH,
Thomas.