David A. Desrosiers schrieb:
One perl regex will strip all of that out:
$content =~ s!<(s(?:cript|tyle))[^>]*>.*?</\1>!!gis;
I use this (and about a dozen other "cleanser" regexes) regularly before
I parse content with my perl spider to reduce the in-memory footprint of
content before I pack it into something I can use elsewhere.
After your first message I have checked sitescooper and the other tools
you have mentioned. Then I started to search for more "cleanser" and
found this:
http://shelldorado.com/scripts/names.html (look for striphtml)
What "cleanser regexes" do you actually use, David? I would be very
glad, if you posted your regexes, in order to get an idea of good ones.
I have started learning shell programming just a couple of weeks ago, so
I am not considering myself a pro, but still a beginner! That's why I
wouldnt be too unhappy, if you sent your script(s).
What is your experience? Is it more efficient, if you have only one
script with all of your regexes inside, checking by itself, what is
going to be done? Or is it better to have separate scripts for different
types of webpages?
How do you handle frames? Surely, you must find the content-page, but
what if that page is missing any navigational links? Lets assume, that
those navigation is centered in a navigation-frame? I am asking, because
I am interested in many more webpages than those two I have pointed to
in my first message. For example: Striphtml from shelldorado gives me
rather average results on this one:
http://www.skolnet.de/DATEN/schg.tgz
I've been thinking about making an auto-report function to my script, so
pages which do not cleanly convert or validate or parse due to broken or
invalid nesting or whatever... sends an email to [EMAIL PROTECTED] with
the full report of what is broken, and lets them decide to fix it or
not.
That would be great! But as long as most of the "coders" will be fixed
on Internet Explorer, we will keep on having bad coded websites.
Marius
_______________________________________________
plucker-list mailing list
[email protected]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list