> I wrote to Newsblaster and received this reply:
[snip]
This is great, exactly what we need! (and I don't have to add it to
my ToDo list to send off an email to them myself =). Keep it up!
> So, maybe they'll get the PDA-friendly version started up again!
Incidentally, I found another site, cnsnews.com, which has news
summaries for the top things going on in the world. They have two versions
of their pages, one for cell, and one for AvantGo users..
http://www.cnsnews.com/cell.asp # cellular, no images
http://www.cnsnews.com/avantgo.asp # PDA, minor images
I fired off my perl spider at it, and noticed that it dies fast
parsing the urls. Further inspection reveals that they're actually using
_backslashes_ in their relative links to the story details. Sigh. One quick
regex solved that in my case, but I'm not sure how to do that in the Python
(or JPluck) code at this point.
Also, if you look closely, their top article is in one font face,
and the remaining articles are in another, because they either added too
many font tags to the lower stories, or they forgot to add one to the top
story. Another place where CSS would solve this, make the page size smaller
(no <font face="Foo" size="2">Blah</font> tags at every para), and more
portable. Since this is targeted to be displayed on a PDA or cell device,
using font faces here doesn't make sense at all, they should remove them
entirely (I do that with some other perl code as well, prior to storing the
content and parsing it for Plucker, with the code below).
$content =~ s,\\,/,sg; # turn invalid \ into / in URLs
# Drop all the 'face="whatever"' attributes and values from
# the <font></font> tags, leaving just <font> with optional
# color and size elements intact, so we can parse them later
while( my $t = $p->get_token ) {
if ($t->[0] eq 'S' and $t->[1] eq 'font') {
my $attr = $t->[2];
delete $attr->{face};
my $attributes = join(" ",
map {qq{$_="$attr->{$_}"}} keys %$attr);
$nff_content .= "<font $attributes>";
} else {
$nff_content .= $t->[$verb{ $t->[0]}];
}
}
As long as we have sites who continue to just make up their own HTML
standards, instead of using the oft-published ones, we'll always be dealing
with this mess of unscrewing bad HTML prior to parsing it. I'm almost
tempted to make an entire perl module to do it, and I'm pretty close as it
is with a quick 100 lines of "ScrubHTML" I whipped up.
d.
_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list