Tyler Clendenin
GSL Solutions
----- Original Message -----
From: Rob Rohan
To: CF-Talk
Sent: Monday, February 09, 2004 11:17 AM
Subject: CFMX - best way to strip content from html page
Hey there hi there ho there,
I was wondering what others have used to strip the content out of web
pages? I am working on a system that collects pages and archives them;
however, only the content needs to be stored (i.e. not the navigation,
images, extra page fodder).
The sites it is archiving are vast so it would have to rather generic
solution. I have seen this kind of thing before, but only for single
specific sites. Does anyone know a good method to do it generically?
I was leaning toward one of these but I am open to whatever
* run the collected html through tidy (or jtidy) then (somehow) use xslt
* (somehow) use a regular _expression_ on the collected html
if anyone has done this before please let me know of pitfalls or
recommendations - BTW I have time not money so any pay solutions are
right out.
Thanks
--
Vale,
Rob
Luxuria immodica insaniam creat.
Sanam formam viatae conservate!
http://www.rohanclan.com
http://treebeard.sourceforge.net
http://ashpool.sourceforge.net
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

