> For *books* I usually edit them manually since often, as you have
> said, one just need to weed out several screenfuls of JS-code from
> the top and bottom of the page.

One perl regex will strip all of that out: 

$content =~ s!<(s(?:cript|tyle))[^>]*>.*?</\1>!!gis;

I use this (and about a dozen other "cleanser" regexes) regularly before
I parse content with my perl spider to reduce the in-memory footprint of
content before I pack it into something I can use elsewhere. 

> Unfortunately, many of such books are just "not really HTML", for
> example, text inside PRE tags, or text w/o P tags, just many
> non-breakable spaces and BRs. Such "HTML files" are mostly useless,
> anyway.

Also easy to fix, if you have a general baseline of how badly "web
developers" can screw up content. Most of it is easily fixed, and what
can't be fixed either blows the parser, or is ignored.

I've been thinking about making an auto-report function to my script, so
pages which do not cleanly convert or validate or parse due to broken or
invalid nesting or whatever... sends an email to [EMAIL PROTECTED] with
the full report of what is broken, and lets them decide to fix it or
not. 

It shouldn't be Plucker's job to fix the Web. 

RSS/RDF/XML feeds are another problem altogether. 

Many aggregators try to "fix" invalid XML and parse it anyway, which
directly violates the XML and rss specification. If a feed is not
well-formed, it should immediately error out. 

Broken XML shouldn't be "fixed" silently by the aggregator. To do so,
does not encourage the feed maintainer to fix their broken feed, and we
end up being back exactly where we are with broken HTML content now, and
its also against the XML specification.



-- 
David A. Desrosiers
desrod gnu-designs com
http://gnu-designs.com

"Erosion of civil liberties... is a threat to national security."

_______________________________________________
plucker-list mailing list
plucker-list@rubberchicken.org
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to