Charles,

Thank you so much. It works like a charm, and upon looking at the code, it
seems so easy to declare beginnings and endings. I have been looking over
the man pages for HTML::TokeParser and confess to not knowing a lot about
hashes like %rss. I am trying to figure out how to add the domain name
before each article that is listed as a link in the RSS file.

For example, in the rss file that is generated, the link is (accurately from
the web page HTML::TokeParser pulls from) listed as:

<link>9874341.htm</link>

But to work correctly it really needs to be
http://www.$pub.com/mld/$pub/rss/link.htm...

Any advice, suggestions appreciated. You've already helped more than I
expected. Thanks so much.

Gary

-----Original Message-----
From: Charles K. Clarkson [mailto:[EMAIL PROTECTED] 
Sent: Saturday, October 09, 2004 1:24 PM
To: 'Gary Nielson'; [EMAIL PROTECTED]
Subject: RE: :RSS and description text

Gary Nielson <[EMAIL PROTECTED]> wrote:

: Clearly, the time to look for this is when going through
: the html line by line also looking for the headline and
: link,

    Actually, (IMO), it is clear that most markup can not
be easily parsed with a simple line by line algorithm.
Using HTML::TokeParser makes more sense (to me).


: but I'm having trouble finding examples of 1) how
: XML::RSS looks for descriptions and 2) how to include
: them in $rss->add_item. Any help much appreciated.

    XML::RSS doesn't "look" for descriptions. It is used to
build an RSS file to be used by an aggregator. You have found
two examples which are named RSS, but which are not RSS pages.
It looks like you are extracting information from the HTML to
create the RSS files yourself.

    In the code example you use something like this.

 $rss->add_item(
     title => $title,
     link  => "$url$1",
 );

    To add a description, we would use this.

 $rss->add_item(
     title       => $2,
     link        => "$url$1",
     description => 'whatever',
 );

    In the code below I use %rss to hold values and than
feed it to add_item().


[snipped helpful example]

    Here's the test I did. I assumed that we would always
want the text between the next two br tags.

    I didn't try this the first time. I first tried using
the span tags. Using this module made that a few keystrokes,
while using regexes would have taken considerably more time
to change.

    Note that we no longer store the file locally. We
don't really want that. The RSS file is enough. You'll
need to change this to create the RSS file instead of
printing it.


#!/usr/bin/perl

use strict;
use warnings;

use XML::RSS;
use LWP::Simple;
use HTML::TokeParser;

my @pub = ( 'bradenton', 'centredaily' );

foreach my $pub ( @pub ) {

    print "Processing publication: $pub....\n";
    my $url = "http://www.$pub.com/mld/$pub/rss/";;

    # create an rss object.
    my $rss = XML::RSS->new( version => '0.91' );

    $rss->channel(
        title => "$pub.com",
        link  => "http://www.$pub.com/";
    );

    print "Getting content from url: $url\n";
    my $lines = get $url or die "get: $!";

    my %rss;

    my $page = HTML::TokeParser->new( \$lines );
    while ( my $token = $page->get_tag( 'a' ) ) {

        # check for correct link
        next unless
                exists $token->[1]{class}
                   and $token->[1]{class} eq 'digest-headline';

        # link and title
        my %rss = (
            link    => $token->[1]{href},
            title   => $page->get_trimmed_text( '/a' ),
        );

        # description
        $page->get_tag( 'br' );
        $rss{description} = $page->get_trimmed_text( 'br' );

        $rss->add_item( %rss );
    }
    print $rss->as_string, "\n";
}

__END__


HTH,

Charles K. Clarkson
-- 
Mobile Homes Specialist
254 968-8328


_______________________________________________
ActivePerl mailing list
[EMAIL PROTECTED]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to