David A. Desrosiers schrieb:

One perl regex will strip all of that out:
$content =~ s!<(s(?:cript|tyle))[^>]*>.*?</\1>!!gis;

I use this (and about a dozen other "cleanser" regexes) regularly before
I parse content with my perl spider to reduce the in-memory footprint of
content before I pack it into something I can use elsewhere.

After your first message I have checked sitescooper and the other tools you have mentioned. Then I started to search for more "cleanser" and found this:

http://shelldorado.com/scripts/names.html  (look for striphtml)

What "cleanser regexes" do you actually use, David? I would be very glad, if you posted your regexes, in order to get an idea of good ones. I have started learning shell programming just a couple of weeks ago, so I am not considering myself a pro, but still a beginner! That's why I wouldnt be too unhappy, if you sent your script(s).

What is your experience? Is it more efficient, if you have only one script with all of your regexes inside, checking by itself, what is going to be done? Or is it better to have separate scripts for different types of webpages?

How do you handle frames? Surely, you must find the content-page, but what if that page is missing any navigational links? Lets assume, that those navigation is centered in a navigation-frame? I am asking, because I am interested in many more webpages than those two I have pointed to in my first message. For example: Striphtml from shelldorado gives me rather average results on this one:

http://www.skolnet.de/DATEN/schg.tgz

I've been thinking about making an auto-report function to my script, so
pages which do not cleanly convert or validate or parse due to broken or
invalid nesting or whatever... sends an email to [EMAIL PROTECTED] with
the full report of what is broken, and lets them decide to fix it or
not.

That would be great! But as long as most of the "coders" will be fixed on Internet Explorer, we will keep on having bad coded websites.


Marius

_______________________________________________
plucker-list mailing list
[email protected]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to