Re: Cleaning up html-Code

Marius Westenberg Sun, 25 Jun 2006 13:31:34 -0700

David A. Desrosiers schrieb:

One perl regex will strip all of that out:
$content =~ s!<(s(?:cript|tyle))[^>]*>.*?</\1>!!gis;
I use this (and about a dozen other "cleanser" regexes) regularly before
I parse content with my perl spider to reduce the in-memory footprint of
content before I pack it into something I can use elsewhere.

After your first message I have checked sitescooper and the other toolsyou have mentioned. Then I started to search for more "cleanser" andfound this:


http://shelldorado.com/scripts/names.html  (look for striphtml)

What "cleanser regexes" do you actually use, David? I would be veryglad, if you posted your regexes, in order to get an idea of good ones.I have started learning shell programming just a couple of weeks ago, soI am not considering myself a pro, but still a beginner! That's why Iwouldnt be too unhappy, if you sent your script(s).

What is your experience? Is it more efficient, if you have only onescript with all of your regexes inside, checking by itself, what isgoing to be done? Or is it better to have separate scripts for differenttypes of webpages?

How do you handle frames? Surely, you must find the content-page, butwhat if that page is missing any navigational links? Lets assume, thatthose navigation is centered in a navigation-frame? I am asking, becauseI am interested in many more webpages than those two I have pointed toin my first message. For example: Striphtml from shelldorado gives merather average results on this one:


http://www.skolnet.de/DATEN/schg.tgz

I've been thinking about making an auto-report function to my script, so
pages which do not cleanly convert or validate or parse due to broken or
invalid nesting or whatever... sends an email to [EMAIL PROTECTED] with
the full report of what is broken, and lets them decide to fix it or

not.

That would be great! But as long as most of the "coders" will be fixedon Internet Explorer, we will keep on having bad coded websites.



Marius

_______________________________________________
plucker-list mailing list
[email protected]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Re: Cleaning up html-Code

Reply via email to