> For *books* I usually edit them manually since often, as you have > said, one just need to weed out several screenfuls of JS-code from > the top and bottom of the page.
One perl regex will strip all of that out: $content =~ s!<(s(?:cript|tyle))[^>]*>.*?</\1>!!gis; I use this (and about a dozen other "cleanser" regexes) regularly before I parse content with my perl spider to reduce the in-memory footprint of content before I pack it into something I can use elsewhere. > Unfortunately, many of such books are just "not really HTML", for > example, text inside PRE tags, or text w/o P tags, just many > non-breakable spaces and BRs. Such "HTML files" are mostly useless, > anyway. Also easy to fix, if you have a general baseline of how badly "web developers" can screw up content. Most of it is easily fixed, and what can't be fixed either blows the parser, or is ignored. I've been thinking about making an auto-report function to my script, so pages which do not cleanly convert or validate or parse due to broken or invalid nesting or whatever... sends an email to [EMAIL PROTECTED] with the full report of what is broken, and lets them decide to fix it or not. It shouldn't be Plucker's job to fix the Web. RSS/RDF/XML feeds are another problem altogether. Many aggregators try to "fix" invalid XML and parse it anyway, which directly violates the XML and rss specification. If a feed is not well-formed, it should immediately error out. Broken XML shouldn't be "fixed" silently by the aggregator. To do so, does not encourage the feed maintainer to fix their broken feed, and we end up being back exactly where we are with broken HTML content now, and its also against the XML specification. -- David A. Desrosiers desrod gnu-designs com http://gnu-designs.com "Erosion of civil liberties... is a threat to national security." _______________________________________________ plucker-list mailing list plucker-list@rubberchicken.org http://lists.rubberchicken.org/mailman/listinfo/plucker-list