Just an FYI:
Note that the get can write directly to a scalar.
$req = HTTP::Request->new(GET => 'http://www.bradenton.com/mld/bradenton/rss/9837290.htm');
Eliminating the need for the join command.
Basil
| "$Bill Luebkert"
<[EMAIL PROTECTED]>
Sent by: [EMAIL PROTECTED] 06/10/2004 07:15 PM |
|
Gary Nielson wrote:
> I am trying to get the first paragraph of an article from an html document.
> I am trying to do this by getting the document from the web, using 'join' to
> make many lines one line, and then trying to isolate the text I want. Is
> this workable?
>
> Here's an example of the area of a longer html document that I am trying to
> parse. (The dateline classes do not appear in all articles. I figure I can
> get rid of the remaining tags later in the script.)
>
> </div>
> <span class="body-content"><!-- begin body-content -->
> <p><b><span class="dateline">SARASOTA</span><span
> class="dateline-separator"> - </span></b>As the search for Carlie Brucia
> intensified, hundreds of leads helped sketch the portrait of the suspect who
> authorities say abducted the 11-year-old from a car wash parking lot in
> February.</p>
> <p>According to the 615 pages of tips and leads released by the State
> Attorney's Office on Tuesday
>
> Here's my script, which returns 'no match':
>
> use LWP::Simple;
> my @lines = get( "http://www.bradenton.com/mld/bradenton/rss/9837290.htm" )
> or die $!;
>
> $line = join "", @lines if defined @lines;
> if ($line =~ /<\!-- begin body-content -->(.*)\/p>/i)
> {
> print $1;
> } else
> {
> print 'no match';
> }
>
Barring the use of an HTML parser, :
use strict;
use LWP::Simple;
my @lines = get "http://www.bradenton.com/mld/bradenton/rss/9837290.htm" or
die "get: $!";
my $line = join '', @lines;
if ($line =~ /<!-- begin body-content -->(.*?)\/p>/is) {
print "$1\n";
} else {
print "No match\n";
}
__END__
--
,-/- __ _ _ $Bill Luebkert Mailto:[EMAIL PROTECTED]
(_/ / ) // // DBE Collectibles Mailto:[EMAIL PROTECTED]
/ ) /--< o // // Castle of Medieval Myth & Magic http://www.todbe.com/
-/-' /___/_<_</_</_ http://dbecoll.tripod.com/ (My Perl/Lakers stuff)
_______________________________________________
ActivePerl mailing list
[EMAIL PROTECTED]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
_______________________________________________ ActivePerl mailing list [EMAIL PROTECTED] To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
